Zero to Coke Hero: MSBA Capstone Spring 2024

Ian Donaldson, Michael Tom, Andrew Walton, Jake Jarrard

March 25, 2024

ML Introduction, Purpose, & Objectives

Project Goal:

Develop a framework for analyzing historical sales of Swire and competitor products to develop a machine learning model that predicts and prescibes the best products to launch, expected demand, best time of the year, region, and/or optimal price.

Business Problem:

Swire Coca-Cola operates extensively, producing, selling, and distributing beverages across 13 states in the American West. The company is known for its regular introduction of new, often limited-time, products aimed at stimulating market demand. However, predicting accurate demand for these innovative products remains a significant challenge. Historical data provides some insight but is insufficient for precise forecasting due to variations in regional and demographic appeal.

The firm stands at the forefront of the Western US beverage distribution sector, continually launching limited edition products to maintain consumer interest and market dominance. Yet, the uncertainty in demand forecasting for these unique items presents risks of either overproduction or shortages, each carrying potential financial and reputational impacts. The project aims to leverage historical sales data to enhance the accuracy of demand predictions for future innovative offerings. In essence can the data science team do a better job of predicting demand for new innovation products than a traditional market research survey, or the gut instinct of a seasoned sales manager?

Ultimately the following questions should be answered for any products worth purusing:

Product 1: Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?

Product 2: Swire plans to release this product 2 weeks prior to Easter and 2 weeks post Easter. What will the forecasted demand be, in weeks, for this product?

Product 3: Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?

Product 4: Swire plans to release this product for the duration of 1 year but only in the Northern region. What will the forecasted demand be, in weeks, for this product?

Product 5: Swire plans to release this product for 13 weeks, but only in one region. Which region would it perform best in?

Product 6: Swire plans to release this product for 6 months. What will the forecasted demand be, in weeks, for this product?

Product 7: Swire plans to release this product in the Southern region for 13 weeks. What will the forecasted demand be, in weeks, for this product?

Analytic Problems:

The analytics problem at hand involves developing a robust framework for analyzing historical sales data of Swire and competitor products. The primary objective is to construct a predictive machine learning model that forecasts sales but in addition other models as simple as multiple linear regression for prescribing the optimal products to launch. This entails estimating expected demand, determining the best timing for launch, and recommending pricing strategies. A significant challenge lies in managing the variability in release durations of both historical and future products, which span from 13 weeks to six months. As these are innovation products many key feature simply do not exist in the historical data adding layers of complexity to the modeling problem.

Purpose of this Notebook:

The purpose of this notebook is to first provide a high level overview of the ML process and results for the Swire Capstone project. The project is focused on forecasting sales for new products and existing products with new flavors. The question of which potential innovation items to model is broken down into 7 key sections. The goal is to provide a high level overview of the results from multiple linear regressions that served as the basis for the recommended products to further modeling using XGBoost. The results from the XGBoost models will be used to provide further recommendations on which products to launch based on expected demand and pricing. By the end of this notebook, we will have a clear understanding of the potential products to launch and the expected sales for each product, while also answering the key unique questions posed by Swire to each innovation.

Data Preparation & Feature Engineering

Given the data size (24M rows), where each row represents a transaction between a manufacturer of soft drinks and an unknown vendor in 13 western US states, the term “Data Mining” could not be more descriptive of the task at hand. While each row looks similar to the other, each real-world sales transaction captured within each observation is very different from its neighbors. Preparing the data required far more thought than an initial glance would lead. We realized that the Swire data set was hundreds (maybe thousands) of data sets combined and aggregated into one massive 2D frame. With that, our first step was ensuring that the data was whole, devoid of errors and blanks - this was typically the case, and we only had to impute .2% of missing values by running simple text analytics on the data frame to ensure it was going to work with any ML algorithm. The data set provided was very well constructed and generally error-free.

Scaling the data down to handle fewer manageable observations proved to be the first task, and it would be conducted differently for any given question. The outliers that we observed when sampling the data were “natural” outliers, and they exist throughout the data. We observed instances where distributors held firm competitive leads in different geographic areas (Market Keys) and, at times, made huge sales, captured in some rows, that threw off an average size of most transactions. These outliers, however, were not errors; they were simply a natural effect of doing business, necessary to moving inventory and generating profit. In several instances, we observed single cases of soft drinks sold at 100x below average price, as if a freebie or incentive was offered to move the product. In other cases, many rows reflected huge inventory sold at standard prices with a 100x dollar sales value given the size of the deal. We observed, across subsections of soft drinks, that the Kansas market was an outlier in and of itself: 3 unique zip codes in 3 unique market keys sold thousands of products beyond the figures of much more significant market areas in heavily populated places such as Seattle, Washington. None of these interpretations could be understood without sampling the data along clear segmented lines (by flavor, brand, manufacturer.) None of the outliers in any sub-dataset proved to be erroneous, and we discarded no outliers to accommodate the need for accuracy in our modeling.

A tedious element of data preparation was understanding the Market Keys and creating appropriate bins to capture them adequately. If market keys spanned multiple US states, we binned into a general geographic region such as “NORTHERN” or “MOUNTAIN.” If other market keys were solidly inside a specific state and metro area, we assigned them to a more specific region, such as “COLORADO” or “NEW MEXICO.” The diversity of each market key was profound, with some market keys spanning huge rural areas and few towns and other market keys representing equally sized populations in three or fewer zip codes. There is far more to explore regarding market keys and geographic regions, including incorporating US Census data. Given the scope of the task, we did not incorporate census data into our calculations. However, we see it as an influential additive element to future more specific targeting of products to local areas.

The ITEM column represented most of the data contained in other rows of the data frame. We spent much time culling information from each ITEM captured in other variables within observations. As an example, “PACKAGE” was formally captured in a column within the entire dataset, but it was also represented in the ITEM column as well; so too, “diet,” “enhanced water,” and “energy” were correctly flagged in each observation, and noted within the ITEM category. Handling and removing redundancy proved essential as “ITEMS” shrinks dramatically when these redundant elements are strategically excised from the ITEM column. By cleaning up the ITEM column, we reduced what seemed like an infinite number of products of a unique flavor down to a small handful of manageable ITEMS that could then be easily manipulated in an ML model. The PACKAGE column was of much interest, and ultimately, we broke down each PACKAGE description (separated by spaces) into a one-hot-encoded element for each observation. Doing so made it much easier to compare products and use our models to divine value from the variety of packaging that exists in combination. The importance of the information in the PACKAGE column can not be overstated. Moreover, we caught ourselves saying that “Swire and competitors are mostly packaging and delivery companies.” Swire, as a client, proved keenly interested in “packaging” in its questions. Through our data preparation and subsequent feature engineering, each string and substring in any given row has a fuzzy albeit real numeric value and proved critical to our modeling; the key was removing redundant information in each row.

Modeling Process

Our modeling process went from simple and fun to complex and technical. First, we used elementary multiple linear regression models natively built into R to dive in and understand each of the seven tasks. Each team member utilized a standardized recipe to prepare data and then ran a GLM on three subsections of the data to answer each question. First, we sampled 10% of the data and filtered it by “BRAND” related to the innovation flavor. We also filtered by “flavor” to observe competitor products, estimate demand, and identify the coefficients of the properties of products that drove demand for that particular flavor. We made data frames of individual flavors and brands to compare and contrast both to see similarities and observe initial anecdotes of the market space where a specific innovation flavor could quickly appear. In some instances, it was clear that Swire concocted ideas that were easy wins. In other cases, some ideas likely fail to succeed from a quick review of descriptive analytics and the output of a GLM. From this initial process, we focused on three specific innovation products worthy of deeper analysis. A note on our use of Linear Regression models: we only used this model to gain immediate insight into our products, we did not use Train/Test splits to run predictive models to reinforce our decision making, but the output of these models gave us enough initial information to make simple decisions based on the client questions put forth.

We entertained some naive ideas, assuming that the data would lead us to answer complicated questions such as estimating demand for a product that never existed. We thought it possible (or easy) to create data for innovation products to be utilized as secondary test data to divine unit sales. We went so far as to generate synthetic data for extrapolating unit sales from elements that did not exist but were drawn from assumptions in the data. Given the disparity between existing model assumptions trained and tested on actual historical data and the distinct characteristics of innovation product scenarios, we soon faced almost insurmountable technical problems. This forced us to return to basics and lean on time-tested ML processes to become even more knowledgeable about our products and domain. This was a hard-earned lesson.

Our team relied on eXtreme Gradient Boosting (XGBOOST) due to its highly efficient and versatile implementation of gradient-boosted decision trees designed for speed and performance. Given its precision in making predictions, we decided to use this method, taking our assumptions of the initial linear models to the next level. The best way forward for us was to know with as much accuracy as possible what elements of products were more highly favored or influential than others in creating demand. We used XGBOOST to break down and evaluate the assumptions we made about each product with an extremely high level of accuracy. Knowing the elements of the innovative products, based on the significant factors that sell competitively similar products, would be the primary way to answer client questions without resorting to even more sophisticated methods to divine near-impossible answers. We engineered our datasets into one-hot-encoded, manageable, more comprehensive tables and ran them through our XGBOOST models. If the task at hand was a classic prediction problem, we would feel very optimistic about the quality of our models predictive power. Using this predictive power in a different way, we drew upon the accuracy and output of our XGBOOST models to bolster our assumptions and solutions in order to provide Swire with very solid answers. In this case, and to be sure our models could answer questions, we split all our data into Train/Test sets and employed 5 fold cross validation techniques to ensure we were not contaminating data. The results of the XGBOOST models almost always reinforced the initial information gleaned from the GLM models, and then added better information with high quality predictive power to substantiate our claims.

We employed the XGBOOST model on innovation products that had more in common with similar historic products. We did not employ XGBOOST on innovation products that, upon initial review, had more disparate properties in common with historical products that they could potentially be grouped among. That is not to say we underestimate the effectiveness of a totally new and unique package or flavor, but given we can only baseline customer expectations on past behavior, our objective was to identify innovation products that showed more consistency with neighboring products.

Model Performance

Again, if this was a classic prediction problem, where we had to estimate the unit sales of a string of transactions we could almost guarantee excellent results. Multiple iterations of our XGBOOST models on training and testing data consistently returned R-squared and adjusted R-squared in the high .90s. Our models demonstrated remarkable predictive accuracy, achieving an impressive R-square and adjusted R-square of around 0.95 in many instances, indicating that it successfully captured most of the variance in our engineered dataset. While there are slight variations in MAE, RMSE, and MAPE between the training and testing phases, these discrepancies highlight the model’s nuanced understanding of the data rather than significant shortcomings. In this case, we are highly confident in the claims, suggestions, and recommendations we make about the historic (real) products relative to innovation products for the benefit of Swire. That said, we owe almost all of our model performance to the very highly correlated UNIT_SALES to DOLLAR_SALES - these two variables are obviously very powerful together. When we remove DOLLAR_SALES from our models, the R2 drops to over .50 for training and testing. However, what we are left with very effectively identifies the additional properties of soft drink sales that are independent of sales price.

We admit that the quality of our model doesn’t answer specific questions directly; there is a great deal of interpretation and insight we must provide to the client in order to hit home runs. We also attribute its predictive power to the stability and reliability of the Swire dataset. So what we can say is that the data is good, even if it has been aggregated together and some of the every day elements of sales transactions and the consumption habits of consumers are long lost. Our models also only explain part of our final recommendations that we will provide to the client, Swire. The modeling that follows essentially gives us the minimal context to begin to argue for our recommendations.

A note on DEMAND: Across this effort, “demand” is conceptualized not as a direct measure of consumer interest, but as a variable that varies significantly across different products and categories. The use of “unit sales” as a proxy for demand is inadequate due to the presence of aggregated sales agreements between manufacturers and retailers, which can distort actual demand metrics. Therefore, the determination of demand will be approached through alternative prescriptive analytics methodologies, separate from the current machine learning framework, to address its multifaceted nature effectively.

Conclusion - Results & Recommendations

It is obvious that Swire’s data science team, our client, hopes to get ahead of its marketing department’s costly and laborious market surveys and product testing efforts related to innovation products. Predicting consumer interest in a product can not be done in a vaccuum. However, we, like our clients at Swire, are believers in data science and machine learning and we believe that, like a well crafted algorithm, we can get very good at predicting what may be successful according to historic data, even if that data is an amalgamation of predecessor products or based on elements of past success.

For our presentation to Swire, we will present a clear evaluation of 2 or 3 of the questions asked of us. We will draw on the power of the following effort and lessons learned to create a demand forecast from pricing optimization strategies (not illustrated in this notebook). What we present will be drawn from conclusions made in this notebook from the exploration of innovation ideas and historical neighbors.

Ultimately our goal is to create an innovation data set, something that represents and resembles the essential elements of a successful historical product without violating best practices in data science and ML modeling. We also are cautious in making statements that we cannot back up without data and results of our models. Once we have that stable footing, our team will use prescriptive analytics to make suggestions for our client that can get well ahead of market surveys and divining consumer interest through taste tests at state fairs. Whether we can create that data set remains to be seen, but we are confident we are much of the way there and we can, after this modeling exercise, answer most of our clients questions without making up statistics on the spot.

SECTION SUMMARIES

We have provided the following modeling in several sections, each focusing on the seven potential Swire innovation products. These aim to provide the high-level initial results from multiple linear regressions that served as the basis for recommended products to model using XGBoost. Summaries of each follow. Some comments are provided in individual sections when necessary. Most comments are captured inside the R code output. Most code has been suppressed mainly as it has lengthy output. We have generally included the exploration of each proposed item using a GLM, mostly for our own edification and education. From there we chose three products we felt had, intially, the most promise to pursue further. We rallied around those three products using XGBOOSTED models to get deep into their properties and provide our assessments and recommendations.

PRODUCT 1: Diet Smash Plum 11Small 4One

Analysis: There are two innovation aspects involved: a new package for Diet Smash and adding an existing flavor (Plum), that has never been a part of the Diet Smash line up of drinks. The linear regressions indicate high potential for forecasting, especially for the small Plum Multiple Linear regressions. Diet Smash shows significance in terms of seasonality, particularly in Summer and Winter. However, Plum by itself is on the edge of not being significant for Summer and Winter. Packaging alone is not as strong for Plum and Diet Smash. Diet Smash comes in two regular types with one size that ran for four weeks. The regression in the innovation data frame suggests PACKAGE12SMALL 24ONE CUP as the configuration that sold the most amongst its standard products. Considering the analysis, we do not recommend further pursuit of Diet Smash and Plum compared to other opportunities. It appears that while there is potential for forecasting, the significance of Plum in specific seasons is not entirely clear. Additionally, the innovative packaging proposed doesn’t seem to provide a strong advantage for the product. Further analysis may be needed to determine the viability of launching this product, particularly in terms of seasonal performance and packaging effectiveness.

MODEL Employed: Multiple Linear Regression

RECOMMENDATION: This innovation product concept has more disparate properties than (has less in common with) historical products. As a result, may prove more difficult to predict sales or model as well as other innovation concepts for this notebook effort.

PRODUCT 2: Sparkling Jacceptabletlester Avocado 11Small MLT

Based on this brand, Sparkling Jacceptabletlester does not traditionally sell well across all variables. It appears to do better when it is REGULAR as opposed to DIET. Its 1.25L MULTI JUG, PACKAGE20SMALL MULTI JUG, PACKAGE2L MULTI JUG & PACKAGE1L MULTI JUG sell better than most other packaging. Only some of its smaller packaging PACKAGE12SMALL 6ONE CUP, PACKAGE12SMALL 8ONE SHADYES JUG, PACKAGE7.5SMALL 8ONE CUP are okay sellers but not as strong as larger sizes. In terms of April, in any given year SparklingJ does not do particularly well in April. We see a winter uptick; not a hot summer seller based on looking at season and month. SparklingJ has a very small window of UNIT_SALES (<3000) to DOLLAR_SALES (<5000). Does better in Colorado and Northern regions. Avocado is a popular flavor, representing more than 7% of all flavored soft drinks in the market (~ 1.8M sales observations.) Avocado flavored drinks do not fair as well as SSD (Standard Soda Pop)

MODEL Employed: Mulitple Linear Regression

RECOMMENDATION: Based on the client questions asked, this innovation product may be the least similar to its would-be competitors already in the market. People would find it a little out of place. The sales of SparklingJ have never been as strong as others. Although the sales of Avocado flavors are solid, this innovation product would possibly stand out from others, and potentially not in a good way. This product would likely need some market research, something we are hoping to avoid.

PRODUCT 3: Diet Venomous Blast Energy Drink Kiwano 16 Liquid Small

The dataset for Kiwano and energy drinks is limited, with very few rows. Despite this limitation, there is still potential to develop a model that can predict launch sales. There exists an argument to create a model that predicts the sales of units of energy drinks, specifically with a size of 16 and kiwano flavor. This model could potentially be used in conjunction with the current sales rate of VENOMUS BLAST launches to provide an accurate forecast. Determining the optimal weeks for selling the product is a challenge. Historical best 13 weeks sales of either VENOMUS BLAST, energy drinks, or kiwano flavored drinks could be used as a reference for identifying potential sales periods. Given the limited data available for Kiwano and energy drinks, it is essential to approach the recommendation cautiously. Additionally, determining the best weeks for selling the product relies heavily on historical sales data, which may not accurately reflect future market conditions.

In exploring and modeling for Kiwano we found there is a sub set of that data provides to be explanitory and has some potential for prediction. In our modeling a subset of the data was selected by filtering products that matched category energy, item containing “Kiwano”, and size contating “16.” Then for our specific questions of when to launch and how much demaned we added in week of the year and week since lauch. With our data set we created an XGboost model that showed some predictive power. Finally, we created a dummy set of data and had our model create preditions for every combination of our features to help predict future sales.

MODELS Employed: Mulitple Linear Regression, XGBOOST

RECOMMENDATION: Overall the unique qualities of Kiwano created a path where we were able to make some predictions on what window of time would be best to launch and take a first step at unit forecasting by week. We recommend this flavor sensation for addition study to go to market.

PRODUCT 4: Diet Square Mulberries Sparkling Water 10Small MLT

We do not think that we should recommend this product. We have very few observations of our Square Brand, and only 19 of those reside in the Northern Region. When widening the scope to look at other factors such as the flavor, category, and region as a whole, it continues to not make sense to sell this product for a whole year in the Northern Region. The Northern Region consistantly demonstrated a negative relationship with total dollar sales in comparison with the other regions we reviewed. The Square brand has been primarily sold in California and the Southwest so far and we don’t have sufficient data to say that this brand would do better than the other sparkling waters that have not done well in the Northern Region. From a time perspective, it also does not make sense to sell this for the whole year. The spring and especially the fall do far better than the winter in all of the regression models we ran. We wouldn’t want to recommend selling it during months where it would not perform well. From a flavor and category perspective, we were able to gain more insights such as the best type of packaging associated with the flavor, but the northern region continued to perform poorly with the flavor mulberries.

MODEL Employed: Mulitple Linear Regression

RECOMMENDATION: We would not recommend moving forward with this product due to the small amount of historical data that we have with the brand and flavor and the poor performance of the region with similar products.

PRODUCT 5: Greetingle Health Beverage Woodsy Yellow .5L 12One Jug

“GREETINGLE Woodsy Yellow” appears to be a good opportunity for further pursuit based on available data. There are only two brands selling three Woodsy Yellow flavored products. The client asked “Swire plans to release this product for 13 weeks, but only in one region. Where would it perform best?” We unequivocally state that the Kansas market (East Kansas) is an excellent choice. And while this may be due to skewed sales represented in the East Kansas region, it would also do well in the Northern Region (East, NE Idaho, Oregon, and Washington). The packaging as proposed would not be a stretch to conceive and sell with the proposed packaging. Greetingle as a brand has seen relatively good dollar sales even with smaller volumes of unit sales. And it could use some company. There is really only one product on the market worth pursuing as a benchmark for success: “TITANS NOURISH WATER BEVERAGE WOODSY YELLOW” presents as a leading competitor. These factors led us to conduct deeper Greetingle Woodsy Yellow as a good idea for a small batch, limited market product.

MODELS Employed: Mulitple Linear Regression, XGBOOST

RECOMMENDATION: Recommend further study for a limited release product to select markets.

PRODUCT 6: Diet Energy Moonlit Casava 2L Multi Jug

“DIET MOONLIT” demonstrates decent sales performance, ranking 69th in total revenue out of 288. This indicates a moderate level of success in the market for Diet Moonlit alone. Our modeling of soft drinks with Casava, particularly coupled into the Energy category, returned good coefficients and modeling results, suggesting that there is good potential for this flavor variant, particularly in a highly caffeinated situation. The analysis identifies some 21-week stretches that appear promising for making six-month predictions. However, it is noted that the challenge lies in selecting which weeks to use for these predictions. Despite “Cassava” not being considered a top-tier flavor option, the analysis suggests the potential for “DIET MOONLIT” given its decent sales performance and good results when we run both GLM and XGBOOST models on the Cassava “innovation” data frame. While Diet Moonlit has 2L Multi JUG success, there is no combination using that package format for Casava and Energy. As Swire has success in the 2L Multi JUG arena for Diet Moonlit, there is potential for a successful product even if Casava and Energy are not historically found in that package format. Casava and Energy are historically only found in a regular caloric segment; however, combining the high predictability of Casava Energy and the high sales of Diet Moonlit, there is potential for a successful product. With identified promising weeks for predictions, there is an opportunity to leverage this information to enhance sales forecasting and strategic decision-making further.

MODELS Employed: Multiple Linear Regression, XGBOOST

RECOMMENDATION: Based on the provided information, we recommend to continue work on consideration to bring Diet Moonlit Casava Energy to market.

PRODUCT 7: Peppy Gentle Drink Pink Woodsy .5L Multi Jug

Through our analysis of a ‘Pink Woodsy’ flavored launch, there was very little evidence that further modeling would create a reliable prediction. As the historical data is missing many accurate features we would like to see in order to explain variation. A few of the features from this specific innovative product that are missing are: A. Lack of comparable flavors. Though there have been products in the past with Pink or Woodsy, there have never been any items with this combination. B. Brand ‘Peppy’ having no innovative product data. In our research of the brand, we found they do not have any innovation data that would give us indications of how a new product would compete if launched. C. Lack of definition of which regions or areas would be considered ‘South’ for this launch.

MODEL Employed: Multiple Linear Regression

RECOMMENDATION: With these crucial factors either being excluded from modeling or using best estimates on the ‘closest’ items, we do not believe moving forward with prediction of this would be advised. With a product such as this, any type of trial data or directions on which items would be most comparable would help assure accuracy.

Selected Innovation Products

3 of the 7 potential innovation products were selected and recommended for further modeling. The XGBoost models were run on these products and the results are summarized below. The four other models we have not included for brevity and given our methods and processes are generally shown in the linear regression review sections below.

Innovation Model 1 - Woodsy Yellow

Item Description: Greetingle Health Beverage Woodsy Yellow .5L 12One Jug a. Caloric Segment: Regular b. Market Category: ING Enhanced Water c. Manufacturer: Swire-CC d. Brand: Greetingle e. Package Type: .5L 12One Jug f. Flavor: ‘Woodsy Yellow’ Q. Swire plans to release this product for 13 weeks, but only in one region. Which region would it perform best in?

Linear Regression Review

df <- readRDS("swire_no_nas.rds")  #load in EDA cleaned data (no nas)
#fixes the market keys to regional areas... 

regions_joinme <- read.csv("states_summary.csv")

unique(regions_joinme$REGION)
##  [1] "NORTHERN"    "DESERT_SW"   "PRAIRIE"     "CALI_NEVADA" "MOUNTAIN"   
##  [6] "SOCAL"       "ARIZONA"     "NEWMEXICO"   "NOCAL"       "COLORADO"   
## [11] "KANSAS"
# "NORTHERN"    "DESERT_SW"   "PRAIRIE"     "CALI_NEVADA"  "MOUNTAIN"    "SOCAL"   "ARIZONA"    "NEWMEXICO"   "NOCAL"    "COLORADO"    "KANSAS" 

str(regions_joinme)
## 'data.frame':    200 obs. of  2 variables:
##  $ MARKET_KEY: int  13 70 179 197 272 352 32 33 44 50 ...
##  $ REGION    : chr  "NORTHERN" "NORTHERN" "DESERT_SW" "DESERT_SW" ...
# Perform a left join using the merge() function
df <- merge(df, regions_joinme[, c("MARKET_KEY", "REGION")], by = "MARKET_KEY", all.x = TRUE)
rm(regions_joinme)
# Update CALORIC_SEGMENT values: 0 if 'DIET/LIGHT', otherwise 1
df$CALORIC_SEGMENT <- ifelse(df$CALORIC_SEGMENT == "DIET/LIGHT", 0, 1)
df$MARKET_KEY <- as.character(df$MARKET_KEY)
df <- df %>%
  mutate(
    MONTH = as.numeric(substr(DATE, 6, 7)),  # Extract the month from YYYY-MM-DD format
    SEASON = case_when(
      MONTH %in% c(12, 01, 02) ~ "WINTER",
      MONTH %in% c(03, 04, 05) ~ "SPRING",
      MONTH %in% c(06, 07, 08) ~ "SUMMER",
      MONTH %in% c(09, 10, 11) ~ "FALL",
      TRUE ~ NA_character_  # This is just in case there are any undefined values
    )
  )
# Making a 10% sample of the data to shrink it 
set.seed(123) # Set a random seed for reproducibility
sampled_df <- df[sample(1:nrow(df), 2446143), ]
rm(df)
#skim(df)
summary(df)
##   MARKET_KEY            DATE           CALORIC_SEGMENT    CATEGORY        
##  Length:2446143     Length:2446143     Min.   :0.0000   Length:2446143    
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :1.0000   Mode  :character  
##                                        Mean   :0.5025                     
##                                        3rd Qu.:1.0000                     
##                                        Max.   :1.0000                     
##    UNIT_SALES        DOLLAR_SALES      MANUFACTURER          BRAND          
##  Min.   :    0.04   Min.   :     0.0   Length:2446143     Length:2446143    
##  1st Qu.:   11.00   1st Qu.:    36.5   Class :character   Class :character  
##  Median :   40.00   Median :   135.1   Mode  :character   Mode  :character  
##  Mean   :  173.43   Mean   :   587.4                                        
##  3rd Qu.:  126.00   3rd Qu.:   427.4                                        
##  Max.   :91778.00   Max.   :409159.3                                        
##    PACKAGE              ITEM              REGION              MONTH       
##  Length:2446143     Length:2446143     Length:2446143     Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 6.283  
##                                                           3rd Qu.: 9.000  
##                                                           Max.   :12.000  
##     SEASON         
##  Length:2446143    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Create a scatter plot with the regression line, colored by MANUFACTURER
# Gives a good overview of the data sampled 

ggplot(df, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
  geom_point(alpha = 0.5) +  # Adjust alpha to avoid overplotting, if necessary
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Add linear regression line
  labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
       x = "UNIT SALES",
       y = "DOLLAR SALES") +
  theme_minimal() +
  theme(legend.position = "bottom")  # Adjust legend position if needed
## `geom_smooth()` using formula = 'y ~ x'

# create a table of total values by brand
brand_summary <- df %>%
  group_by(BRAND) %>%
  summarise(
    total_units_sold = sum(UNIT_SALES),
    total_revenue = sum(DOLLAR_SALES),
    avg_price = total_revenue / total_units_sold,
    total_days_sold = n() # Count the number of rows for each brand
  ) %>%
  arrange(desc(total_revenue)) %>%  # Order by revenue in descending order
  mutate(rank = row_number()) 

summary(brand_summary)
##     BRAND           total_units_sold   total_revenue         avg_price      
##  Length:288         Min.   :       1   Min.   :        1   Min.   : 0.5315  
##  Class :character   1st Qu.:    2310   1st Qu.:     7563   1st Qu.: 2.0861  
##  Mode  :character   Median :   94691   Median :   266075   Median : 3.0291  
##                     Mean   : 1473003   Mean   :  4989427   Mean   : 3.2661  
##                     3rd Qu.:  651385   3rd Qu.:  2161764   3rd Qu.: 3.7252  
##                     Max.   :40414038   Max.   :159387186   Max.   :42.9378  
##  total_days_sold         rank       
##  Min.   :     1.0   Min.   :  1.00  
##  1st Qu.:   121.8   1st Qu.: 72.75  
##  Median :  1988.0   Median :144.50  
##  Mean   :  8493.5   Mean   :144.50  
##  3rd Qu.:  8075.8   3rd Qu.:216.25  
##  Max.   :124603.0   Max.   :288.00
print(brand_summary[brand_summary$BRAND == "GREETINGLE", ])
## # A tibble: 1 × 6
##   BRAND      total_units_sold total_revenue avg_price total_days_sold  rank
##   <chr>                 <dbl>         <dbl>     <dbl>           <int> <int>
## 1 GREETINGLE         2170595.      5740637.      2.64           48787    47

GREETINGLE is a good brand ranking 47th out of 288 brands in terms of total revenue, with an average price of $2.65 slightly below the overall mean of $3.27.

# Filter the dataframe for only 'GREETINGLE'
filtered_df <- df %>% 
  filter(BRAND == "GREETINGLE")

# Create the plot
ggplot(filtered_df, aes(x = UNIT_SALES, y = DOLLAR_SALES)) +
  geom_point(color = "red", alpha = 1) +  # Bright red points with full opacity
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Add linear regression line without confidence band
  labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES for GREETINGLE",
       x = "UNIT SALES",
       y = "DOLLAR SALES") +
  theme_minimal() +
  theme(legend.position = "none")  
## `geom_smooth()` using formula = 'y ~ x'

GREETINGLE has a wide body of groupings, where the one wildly steep grouping shows revenue higher at lower volume unit sales for some odd reason. This is clearly a product of odd deals made outside the norm.

# Sales by Week of the Year

filtered_df %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  ggplot(aes(x = WEEK, y = total_sales)) +
  geom_line(color = "black") +  # Blue line connecting points
  labs(title = "Total Sales by Week of the Year",
       x = "Week of the Year",
       y = "Total Unit Sales") +
  theme_minimal()

> GREETINGLE sells solidly from early spring (week 10) through mid-summer (week 30) and starts a slow down through the latter part of the year.

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Calculate total sales for each group of 211 consecutive weeks (6 months)
sales_by_group <- filtered_df %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_group$week_label <- factor(sales_by_group$week_label, levels = sales_by_group$week_label[order(sales_by_group$WEEK)])
ggplot(sales_by_group, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 6-month Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

> GREETINGLE shows a similar pattern across weekly groupings; that is to say its solid in early spring through mid-summer then starts a slow decline to end of year.

#find the best 21 weeks for Woodsy sales - WOODSY YELLOW has a space in the middle
# Calculate total sales for each group of 21 consecutive weeks
sales_by_woodsy <- df %>%
  filter(str_detect(ITEM, "WOODSY  YELLOW")) %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_woodsy$week_label <- factor(sales_by_woodsy$week_label, levels = sales_by_woodsy$week_label[order(sales_by_woodsy$WEEK)])
ggplot(sales_by_woodsy, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 21-Week Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

> Woodsy Yellow sales are best in the 21 weeks from week 14 to 34.

#find the best 21 weeks for ING Enhanced Water, WOODSY, YELLOW
# Calculate total sales for each group of 21 consecutive weeks

sales_by_innovation <- df %>%
  filter(CATEGORY == "ING ENHANCED WATER",
         str_detect(ITEM, "WOODSY"),
         str_detect(ITEM, "YELLOW")) %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_innovation$week_label <- factor(sales_by_innovation$week_label, levels = sales_by_innovation$week_label[order(sales_by_innovation$WEEK)])
ggplot(sales_by_innovation, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 13-Week Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

A combination of the above shows again that most of the year is hot as things slowly taper off end of year and pick back up early spring through mid summer and hold solid.

# Making a new smaller "innovation" data frame
GREETINGLE_DF <- filtered_df
#create innovation based on ING ENHANCED WATER
innovation<- df %>%
  filter(CATEGORY == "ING ENHANCED WATER",
         str_detect(ITEM, "WOODSY"),
         str_detect(ITEM, "YELLOW"))


#unique PACKAGE string from innovation
print(unique(innovation$PACKAGE))
## [1] ".5L 6ONE JUG"      "20SMALL MULTI JUG" "26-32SMALL MLT"   
## [4] ".5L MULTI JUG"
library(dplyr)
library(lubridate)

innovation <- innovation %>%
  mutate(
    MONTH = month(ymd(DATE)),  # Extract month using lubridate's ymd function
    MONTH = as.factor(MONTH)   # Convert the extracted month into a factor
  )

str(innovation)
## 'data.frame':    9642 obs. of  13 variables:
##  $ MARKET_KEY     : chr  "969" "137" "1135" "924" ...
##  $ DATE           : chr  "2022-05-14" "2021-01-09" "2022-04-16" "2021-07-03" ...
##  $ CALORIC_SEGMENT: num  0 1 0 1 0 0 1 1 1 0 ...
##  $ CATEGORY       : chr  "ING ENHANCED WATER" "ING ENHANCED WATER" "ING ENHANCED WATER" "ING ENHANCED WATER" ...
##  $ UNIT_SALES     : num  2 88 100 25 30 7 155 127 37 5 ...
##  $ DOLLAR_SALES   : num  8.98 102.28 107.17 44.71 118.26 ...
##  $ MANUFACTURER   : chr  "COCOS" "COCOS" "COCOS" "COCOS" ...
##  $ BRAND          : chr  "STRONGLY ENERGY WATER" "VITAMINAL FLOW" "STRONGLY ENERGY WATER" "VITAMINAL FLOW" ...
##  $ PACKAGE        : chr  ".5L 6ONE JUG" "20SMALL MULTI JUG" "20SMALL MULTI JUG" "26-32SMALL MLT" ...
##  $ ITEM           : chr  "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY  YELLOW  JUG 16.9 LIQUID SMALL X6" "TITANS NOURISH WATER WATER BEVERAGE WOODSY  YELLOW  JUG 20 LIQUID SMALL" "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY  YELLOW  JUG 20 LIQUID SMALL" "TITANS NOURISH WATER NUTRIENT ENHANCED WATER BVRG WOODSY  YELLOW  JUG 32 LIQUID SMALL" ...
##  $ REGION         : chr  "SOCAL" "CALI_NEVADA" "PRAIRIE" "COLORADO" ...
##  $ MONTH          : Factor w/ 12 levels "1","2","3","4",..: 5 1 4 7 4 8 6 2 8 10 ...
##  $ SEASON         : chr  "SPRING" "WINTER" "SPRING" "SUMMER" ...
print(unique(innovation$ITEM))
## [1] "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY  YELLOW  JUG 16.9 LIQUID SMALL X6"  
## [2] "TITANS NOURISH WATER WATER BEVERAGE WOODSY  YELLOW  JUG 20 LIQUID SMALL"              
## [3] "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY  YELLOW  JUG 20 LIQUID SMALL"       
## [4] "TITANS NOURISH WATER NUTRIENT ENHANCED WATER BVRG WOODSY  YELLOW  JUG 32 LIQUID SMALL"
## [5] "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY  YELLOW  JUG 16.9 LIQUID SMALL"
# Count the number of unique PACKAGE column of our sample
table(innovation$PACKAGE)
## 
##      .5L 6ONE JUG     .5L MULTI JUG 20SMALL MULTI JUG    26-32SMALL MLT 
##              1782                 5              4926              2929
#This gives us an idea that there are only 4 package combinations with 7 sub-string elements
# Looking at the competitors within the 'innovation' data frame
model <- lm(UNIT_SALES ~ DOLLAR_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
## 
## Call:
## lm(formula = UNIT_SALES ~ DOLLAR_SALES + CALORIC_SEGMENT + PACKAGE + 
##     SEASON + REGION, data = innovation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -581.68  -11.32   -0.20    9.87  299.96 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -19.945521   0.932885 -21.380  < 2e-16 ***
## DOLLAR_SALES               0.582900   0.002442 238.652  < 2e-16 ***
## CALORIC_SEGMENT            9.483530   0.646691  14.665  < 2e-16 ***
## PACKAGE.5L MULTI JUG      15.711797  11.801624   1.331 0.183114    
## PACKAGE20SMALL MULTI JUG  30.867445   0.824520  37.437  < 2e-16 ***
## PACKAGE26-32SMALL MLT     -5.117949   0.905559  -5.652 1.63e-08 ***
## SEASONSPRING               1.993019   0.767041   2.598 0.009382 ** 
## SEASONSUMMER               3.395235   0.790524   4.295 1.76e-05 ***
## SEASONWINTER               2.345420   0.758544   3.092 0.001994 ** 
## REGIONCALI_NEVADA         -0.591899   1.751842  -0.338 0.735467    
## REGIONCOLORADO            -1.508996   0.913527  -1.652 0.098601 .  
## REGIONDESERT_SW            2.858842   1.044138   2.738 0.006193 ** 
## REGIONKANSAS             -26.011662   2.915571  -8.922  < 2e-16 ***
## REGIONMOUNTAIN             5.292040   1.064355   4.972 6.74e-07 ***
## REGIONNEWMEXICO           -5.049555   1.528053  -3.305 0.000955 ***
## REGIONNOCAL               -4.243541   1.713116  -2.477 0.013263 *  
## REGIONNORTHERN             8.757881   0.758761  11.542  < 2e-16 ***
## REGIONPRAIRIE              0.156771   1.938273   0.081 0.935538    
## REGIONSOCAL               -0.894532   1.228555  -0.728 0.466560    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.33 on 9623 degrees of freedom
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.9086 
## F-statistic:  5329 on 18 and 9623 DF,  p-value: < 2.2e-16
# Looking at the competitors within the 'innovation' data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
## 
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + 
##     SEASON + REGION, data = innovation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -368.48  -17.90   -2.46   12.49 1059.78 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               37.85905    1.46502  25.842  < 2e-16 ***
## UNIT_SALES                 1.46760    0.00615 238.652  < 2e-16 ***
## CALORIC_SEGMENT           -4.01836    1.03672  -3.876 0.000107 ***
## PACKAGE.5L MULTI JUG     -29.13477   18.72549  -1.556 0.119768    
## PACKAGE20SMALL MULTI JUG -37.01711    1.34853 -27.450  < 2e-16 ***
## PACKAGE26-32SMALL MLT      5.61164    1.43813   3.902 9.60e-05 ***
## SEASONSPRING              -3.10088    1.21711  -2.548 0.010858 *  
## SEASONSUMMER              -3.66627    1.25500  -2.921 0.003494 ** 
## SEASONWINTER              -6.84062    1.20219  -5.690 1.31e-08 ***
## REGIONCALI_NEVADA         -7.32499    2.77873  -2.636 0.008400 ** 
## REGIONCOLORADO             0.41421    1.44973   0.286 0.775101    
## REGIONDESERT_SW           -7.63531    1.65559  -4.612 4.04e-06 ***
## REGIONKANSAS              98.20160    4.53620  21.648  < 2e-16 ***
## REGIONMOUNTAIN           -10.17712    1.68784  -6.030 1.70e-09 ***
## REGIONNEWMEXICO           -3.08763    2.42579  -1.273 0.203109    
## REGIONNOCAL                4.15890    2.71881   1.530 0.126130    
## REGIONNORTHERN           -13.49698    1.20443 -11.206  < 2e-16 ***
## REGIONPRAIRIE              2.73801    3.07541   0.890 0.373332    
## REGIONSOCAL               -5.17125    1.94874  -2.654 0.007976 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.79 on 9623 degrees of freedom
## Multiple R-squared:  0.8958, Adjusted R-squared:  0.8956 
## F-statistic:  4597 on 18 and 9623 DF,  p-value: < 2.2e-16

Very interesting findings when unit sales and dollar sales are alternated as the target variable. The significant differences in REGIONKANSAS between the two models highlight a unique regional effect: while there’s a notable decrease in UNIT_SALES, suggesting lesser product volume sold in Kansas, DOLLAR_SALES significantly increase, indicating higher revenue per unit or preference for pricier products- or something like that. This discrepancy underscores distinct market dynamics in Kansas, where despite lower quantities, the revenue impact is markedly higher, reflecting unique consumer behavior or pricing strategies. Such findings suggest that in Kansas, market strategies should not solely focus on volume but also on the pricing and type of products offered. Conversely, the NORTHERN region coefficient presents contrasting effects in the two models: it significantly increases UNIT_SALES, indicating a higher volume of products sold in the Northern region, but it significantly decreases DOLLAR_SALES, implying lower revenue per unit or a preference for less expensive products. This contrast suggests unique market dynamics in the Northern region, where products are popular in terms of quantity but not necessarily in terms of revenue, possibly due to lower pricing or the sale of lower-priced items.

# Interesting shrinking of our df where we shrink our sales and volume, and see its effect on our GREETINGLE only. 

small_group <- df %>%
  filter(UNIT_SALES < 6000, DOLLAR_SALES < 12000)

#skim(small_group)
skim(df %>% filter(BRAND == "GREETINGLE"))
Data summary
Name df %>% filter(BRAND == “G…
Number of rows 48787
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MARKET_KEY 0 1 1 4 0 200 0
DATE 0 1 10 10 0 148 0
CATEGORY 0 1 18 18 0 1 0
MANUFACTURER 0 1 8 8 0 1 0
BRAND 0 1 10 10 0 1 0
PACKAGE 0 1 12 17 0 5 0
ITEM 0 1 60 116 0 36 0
REGION 0 1 5 11 0 11 0
SEASON 0 1 4 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.00 0.00 0.0 0.00 0.00 0.00 0.00 ▁▁▇▁▁
UNIT_SALES 0 1 44.49 98.77 1.0 8.00 23.00 54.00 4616.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 117.67 284.42 0.3 26.94 64.71 131.82 11801.56 ▇▁▁▁▁
MONTH 0 1 6.33 3.38 1.0 4.00 6.00 9.00 12.00 ▇▆▆▅▇
# Create a scatter plot with the regression line, colored by MANUFACTURER
ggplot(small_group, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
  geom_point(alpha = 0.5) +  # Adjust alpha to avoid overplotting, if necessary
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Add linear regression line without confidence band for clarity
  labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
       x = "UNIT SALES",
       y = "DOLLAR SALES") +
  theme_minimal() +
  theme(legend.position = "bottom") 
## `geom_smooth()` using formula = 'y ~ x'

# Create a new data frame with only the rows where the ITEM column contains the words "woodsy yellow"
wy_small <- df[grep("woodsy  yellow", df$ITEM, ignore.case = TRUE), ]
skim(wy_small)
Data summary
Name wy_small
Number of rows 9642
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MARKET_KEY 0 1 1 4 0 200 0
DATE 0 1 10 10 0 148 0
CATEGORY 0 1 18 18 0 1 0
MANUFACTURER 0 1 5 5 0 1 0
BRAND 0 1 14 21 0 2 0
PACKAGE 0 1 12 17 0 4 0
ITEM 0 1 71 85 0 5 0
REGION 0 1 5 11 0 11 0
SEASON 0 1 4 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.50 0.50 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▇
UNIT_SALES 0 1 64.27 87.13 1.00 9.0 30.00 87.00 1259.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 105.30 129.34 0.75 23.5 59.85 139.99 1909.83 ▇▁▁▁▁
MONTH 0 1 6.37 3.51 1.00 3.0 6.00 9.00 12.00 ▇▆▅▅▇

Woodsy Yellow shows unit sales average of 64.27 and dollar sales average of $105.30. Compare that to 10% df 142.32 is unit sales and $465.11 for dollar sales. And Greetingle sales on average are 44.50 for unit sales and $117.67 dollar sales. This makes interesting ratios.

# wy small is dataframe - caloric segment and category removed since we have only one each
model <- lm(DOLLAR_SALES ~ UNIT_SALES + PACKAGE + CALORIC_SEGMENT + SEASON + REGION, data = wy_small)
summary(model)
## 
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + PACKAGE + CALORIC_SEGMENT + 
##     SEASON + REGION, data = wy_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -368.48  -17.90   -2.46   12.49 1059.78 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               37.85905    1.46502  25.842  < 2e-16 ***
## UNIT_SALES                 1.46760    0.00615 238.652  < 2e-16 ***
## PACKAGE.5L MULTI JUG     -29.13477   18.72549  -1.556 0.119768    
## PACKAGE20SMALL MULTI JUG -37.01711    1.34853 -27.450  < 2e-16 ***
## PACKAGE26-32SMALL MLT      5.61164    1.43813   3.902 9.60e-05 ***
## CALORIC_SEGMENT           -4.01836    1.03672  -3.876 0.000107 ***
## SEASONSPRING              -3.10088    1.21711  -2.548 0.010858 *  
## SEASONSUMMER              -3.66627    1.25500  -2.921 0.003494 ** 
## SEASONWINTER              -6.84062    1.20219  -5.690 1.31e-08 ***
## REGIONCALI_NEVADA         -7.32499    2.77873  -2.636 0.008400 ** 
## REGIONCOLORADO             0.41421    1.44973   0.286 0.775101    
## REGIONDESERT_SW           -7.63531    1.65559  -4.612 4.04e-06 ***
## REGIONKANSAS              98.20160    4.53620  21.648  < 2e-16 ***
## REGIONMOUNTAIN           -10.17712    1.68784  -6.030 1.70e-09 ***
## REGIONNEWMEXICO           -3.08763    2.42579  -1.273 0.203109    
## REGIONNOCAL                4.15890    2.71881   1.530 0.126130    
## REGIONNORTHERN           -13.49698    1.20443 -11.206  < 2e-16 ***
## REGIONPRAIRIE              2.73801    3.07541   0.890 0.373332    
## REGIONSOCAL               -5.17125    1.94874  -2.654 0.007976 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.79 on 9623 degrees of freedom
## Multiple R-squared:  0.8958, Adjusted R-squared:  0.8956 
## F-statistic:  4597 on 18 and 9623 DF,  p-value: < 2.2e-16
# wy small is dataframe - caloric segment and category removed since we have only one each
model <- lm(UNIT_SALES ~  DOLLAR_SALES + PACKAGE + CALORIC_SEGMENT + SEASON + REGION, data = wy_small)
summary(model)
## 
## Call:
## lm(formula = UNIT_SALES ~ DOLLAR_SALES + PACKAGE + CALORIC_SEGMENT + 
##     SEASON + REGION, data = wy_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -581.68  -11.32   -0.20    9.87  299.96 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -19.945521   0.932885 -21.380  < 2e-16 ***
## DOLLAR_SALES               0.582900   0.002442 238.652  < 2e-16 ***
## PACKAGE.5L MULTI JUG      15.711797  11.801624   1.331 0.183114    
## PACKAGE20SMALL MULTI JUG  30.867445   0.824520  37.437  < 2e-16 ***
## PACKAGE26-32SMALL MLT     -5.117949   0.905559  -5.652 1.63e-08 ***
## CALORIC_SEGMENT            9.483530   0.646691  14.665  < 2e-16 ***
## SEASONSPRING               1.993019   0.767041   2.598 0.009382 ** 
## SEASONSUMMER               3.395235   0.790524   4.295 1.76e-05 ***
## SEASONWINTER               2.345420   0.758544   3.092 0.001994 ** 
## REGIONCALI_NEVADA         -0.591899   1.751842  -0.338 0.735467    
## REGIONCOLORADO            -1.508996   0.913527  -1.652 0.098601 .  
## REGIONDESERT_SW            2.858842   1.044138   2.738 0.006193 ** 
## REGIONKANSAS             -26.011662   2.915571  -8.922  < 2e-16 ***
## REGIONMOUNTAIN             5.292040   1.064355   4.972 6.74e-07 ***
## REGIONNEWMEXICO           -5.049555   1.528053  -3.305 0.000955 ***
## REGIONNOCAL               -4.243541   1.713116  -2.477 0.013263 *  
## REGIONNORTHERN             8.757881   0.758761  11.542  < 2e-16 ***
## REGIONPRAIRIE              0.156771   1.938273   0.081 0.935538    
## REGIONSOCAL               -0.894532   1.228555  -0.728 0.466560    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.33 on 9623 degrees of freedom
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.9086 
## F-statistic:  5329 on 18 and 9623 DF,  p-value: < 2.2e-16

The simple insights of a linear regression on the data make useful points. Lets see if an XGBOOST can do better.

XGBOOOST

# Load and prepare dataset, created separately for brevity derived from the wy_small but one hot encoded  

df <- read.csv("woodsy_one_hot.csv") 
df <- df %>% 
  select(-DATE, -MONTH)
# Summarize the dataset
skimr::skim(df)
Data summary
Name df
Number of rows 9642
Number of columns 30
_______________________
Column type frequency:
numeric 30
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.50 0.50 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▇
UNIT_SALES 0 1 64.27 87.13 1.00 9.0 30.00 87.00 1259.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 105.30 129.34 0.75 23.5 59.85 139.99 1909.83 ▇▁▁▁▁
POINT5L 0 1 0.19 0.39 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▂
X20SMALL 0 1 0.51 0.50 0.00 0.0 1.00 1.00 1.00 ▇▁▁▁▇
X26.32SMALL 0 1 0.30 0.46 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▃
X6ONE 0 1 0.18 0.39 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▂
JUG 0 1 0.70 0.46 0.00 0.0 1.00 1.00 1.00 ▃▁▁▁▇
MLT 0 1 0.30 0.46 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▃
MULTI 0 1 0.51 0.50 0.00 0.0 1.00 1.00 1.00 ▇▁▁▁▇
BRAND_STRONGLY.ENERGY.WATER 0 1 0.40 0.49 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▅
BRAND_VITAMINAL.FLOW 0 1 0.60 0.49 0.00 0.0 1.00 1.00 1.00 ▅▁▁▁▇
REGION_ARIZONA 0 1 0.29 0.45 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▃
REGION_CALI_NEVADA 0 1 0.03 0.16 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_COLORADO 0 1 0.13 0.33 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_DESERT_SW 0 1 0.09 0.28 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_KANSAS 0 1 0.01 0.10 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_MOUNTAIN 0 1 0.08 0.28 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_NEWMEXICO 0 1 0.04 0.19 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_NOCAL 0 1 0.03 0.16 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_NORTHERN 0 1 0.24 0.43 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▂
REGION_PRAIRIE 0 1 0.02 0.14 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
REGION_SOCAL 0 1 0.06 0.24 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▁
SEASON_FALL 0 1 0.23 0.42 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▂
SEASON_SPRING 0 1 0.26 0.44 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▃
SEASON_SUMMER 0 1 0.23 0.42 0.00 0.0 0.00 0.00 1.00 ▇▁▁▁▂
SEASON_WINTER 0 1 0.28 0.45 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▃
ITEM_STRONGLY.IONIC.WATER.BEVERAGE.WOODSY.YELLOW. 0 1 0.40 0.49 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▅
ITEM_TITANS.NOURISH.WATER.BEVERAGE.WOODSY.YELLOW. 0 1 0.30 0.46 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▃
ITEM_TITANS.NOURISH.WATER.NUTRIENT.ENHANCED.BVRG.WOODSY.YELLOW. 0 1 0.30 0.46 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▃
# Split the data
set.seed(123)
df_testtrn <- initial_split(df, prop = 0.8, strata = UNIT_SALES)
Train <- training(df_testtrn)
Test <- testing(df_testtrn)

# Prepare features and labels for XGBoost
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES

# Convert data to DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Define XGBoost parameters
set.seed(123)
params <- list(
  booster = "gbtree",
  objective = "reg:squarederror",
  eval_metric = "rmse",
  eta = 0.05,
  max_depth = 4,
  min_child_weight = 3,
  subsample = 0.7,
  colsample_bytree = 0.6,
  lambda = 1,
  alpha = 1
)
# Perform cross-validation to find the optimal number of boosting rounds
cv_results <- xgb.cv(
  params = params,
  data = dtrain,  
  nfold = 5,
  nrounds = 500,  # Changed from 'num_boost_round' to 'nrounds'
  early_stopping_rounds = 10,
  metrics = "rmse",
  seed = 123
)
## [1]  train-rmse:104.438734+1.862060  test-rmse:104.277497+5.560981 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:100.481420+1.828578  test-rmse:100.300344+5.635670 
## [3]  train-rmse:97.437686+1.724683   test-rmse:97.328994+5.723550 
## [4]  train-rmse:93.509634+1.516900   test-rmse:93.421599+5.714666 
## [5]  train-rmse:90.528974+1.683243   test-rmse:90.421465+5.492339 
## [6]  train-rmse:87.285081+1.895664   test-rmse:87.148105+5.487865 
## [7]  train-rmse:83.916617+1.649719   test-rmse:83.807786+5.331198 
## [8]  train-rmse:80.411010+1.590570   test-rmse:80.328766+5.610827 
## [9]  train-rmse:77.374544+2.002057   test-rmse:77.308995+5.784253 
## [10] train-rmse:73.891305+1.918699   test-rmse:73.835653+5.606312 
## [11] train-rmse:70.844607+1.608307   test-rmse:70.803943+5.597310 
## [12] train-rmse:67.984042+1.731115   test-rmse:67.980780+5.192038 
## [13] train-rmse:65.292031+1.919502   test-rmse:65.301321+5.581004 
## [14] train-rmse:62.984255+2.370586   test-rmse:62.942382+5.732470 
## [15] train-rmse:60.817700+2.864825   test-rmse:60.725003+5.909879 
## [16] train-rmse:58.408678+2.591975   test-rmse:58.346380+5.762669 
## [17] train-rmse:56.160713+2.491385   test-rmse:56.111212+5.433884 
## [18] train-rmse:53.793769+2.389017   test-rmse:53.760829+5.327687 
## [19] train-rmse:51.870093+1.921023   test-rmse:51.874732+5.251952 
## [20] train-rmse:50.175292+1.834348   test-rmse:50.209783+5.044398 
## [21] train-rmse:48.508652+1.584712   test-rmse:48.595674+4.824019 
## [22] train-rmse:46.713480+1.247490   test-rmse:46.832437+4.790795 
## [23] train-rmse:44.834882+1.194239   test-rmse:44.955717+4.731072 
## [24] train-rmse:43.469042+1.435777   test-rmse:43.591795+4.865448 
## [25] train-rmse:42.178935+1.701390   test-rmse:42.312723+4.991194 
## [26] train-rmse:41.142637+1.894216   test-rmse:41.310577+5.191422 
## [27] train-rmse:39.932724+1.767986   test-rmse:40.107775+5.006327 
## [28] train-rmse:38.460044+1.683877   test-rmse:38.647515+4.945043 
## [29] train-rmse:37.404323+1.584751   test-rmse:37.637056+5.158217 
## [30] train-rmse:36.231962+1.617828   test-rmse:36.461994+4.950549 
## [31] train-rmse:35.410486+1.657499   test-rmse:35.630969+4.642916 
## [32] train-rmse:34.653119+1.777648   test-rmse:34.869303+4.383160 
## [33] train-rmse:33.592384+1.479384   test-rmse:33.876115+4.334526 
## [34] train-rmse:32.754220+1.448990   test-rmse:33.055765+4.231181 
## [35] train-rmse:31.948008+1.413150   test-rmse:32.275954+4.064136 
## [36] train-rmse:31.362818+1.443906   test-rmse:31.710906+4.100751 
## [37] train-rmse:30.840865+1.212678   test-rmse:31.260238+4.201705 
## [38] train-rmse:30.129323+1.192861   test-rmse:30.612032+4.454154 
## [39] train-rmse:29.458635+1.349180   test-rmse:29.953338+4.216062 
## [40] train-rmse:28.711794+1.382303   test-rmse:29.228291+4.378585 
## [41] train-rmse:28.088652+1.242652   test-rmse:28.608545+4.187630 
## [42] train-rmse:27.620946+1.322708   test-rmse:28.179892+4.012197 
## [43] train-rmse:27.192269+1.353644   test-rmse:27.772997+4.060892 
## [44] train-rmse:26.761063+1.316534   test-rmse:27.338458+4.096274 
## [45] train-rmse:26.351141+1.189527   test-rmse:26.948092+3.984218 
## [46] train-rmse:25.954594+1.139868   test-rmse:26.606237+3.900324 
## [47] train-rmse:25.498486+1.086999   test-rmse:26.192822+4.088299 
## [48] train-rmse:25.151940+1.054015   test-rmse:25.889674+4.029382 
## [49] train-rmse:24.630594+1.023642   test-rmse:25.407511+3.922890 
## [50] train-rmse:24.338249+1.085421   test-rmse:25.130405+3.959537 
## [51] train-rmse:23.870611+0.987387   test-rmse:24.732425+3.965864 
## [52] train-rmse:23.516537+1.056110   test-rmse:24.406818+3.867239 
## [53] train-rmse:23.186704+1.153537   test-rmse:24.095073+3.816563 
## [54] train-rmse:22.846258+1.087059   test-rmse:23.764960+3.840559 
## [55] train-rmse:22.465104+1.059315   test-rmse:23.425253+3.738883 
## [56] train-rmse:22.164082+0.995135   test-rmse:23.153606+3.705256 
## [57] train-rmse:22.013707+0.887810   test-rmse:23.042239+3.748525 
## [58] train-rmse:21.801110+0.873600   test-rmse:22.884238+3.833518 
## [59] train-rmse:21.597672+0.908496   test-rmse:22.686907+3.670807 
## [60] train-rmse:21.356470+0.926316   test-rmse:22.483457+3.654897 
## [61] train-rmse:21.189974+0.906270   test-rmse:22.340283+3.651477 
## [62] train-rmse:20.855675+0.874958   test-rmse:22.057103+3.609503 
## [63] train-rmse:20.673638+0.736343   test-rmse:21.929218+3.723640 
## [64] train-rmse:20.528548+0.808443   test-rmse:21.807281+3.651506 
## [65] train-rmse:20.384607+0.889326   test-rmse:21.689961+3.582942 
## [66] train-rmse:20.121957+0.918966   test-rmse:21.467183+3.520708 
## [67] train-rmse:20.022982+0.849674   test-rmse:21.404976+3.562103 
## [68] train-rmse:19.736924+0.826577   test-rmse:21.177375+3.531546 
## [69] train-rmse:19.609901+0.787196   test-rmse:21.070057+3.482403 
## [70] train-rmse:19.434451+0.778837   test-rmse:20.909645+3.378706 
## [71] train-rmse:19.337069+0.838936   test-rmse:20.831648+3.347554 
## [72] train-rmse:19.194948+0.925278   test-rmse:20.715044+3.273962 
## [73] train-rmse:18.985850+0.863570   test-rmse:20.548627+3.244418 
## [74] train-rmse:18.795053+0.825993   test-rmse:20.417508+3.274105 
## [75] train-rmse:18.655923+0.848086   test-rmse:20.289060+3.290978 
## [76] train-rmse:18.571538+0.908151   test-rmse:20.213214+3.263608 
## [77] train-rmse:18.447161+0.910371   test-rmse:20.106792+3.268641 
## [78] train-rmse:18.286200+0.925715   test-rmse:19.967093+3.249646 
## [79] train-rmse:18.237222+0.953398   test-rmse:19.902386+3.196439 
## [80] train-rmse:18.129728+1.024957   test-rmse:19.783508+3.114773 
## [81] train-rmse:18.048264+1.005337   test-rmse:19.719807+3.117816 
## [82] train-rmse:17.978597+1.040892   test-rmse:19.660296+3.069657 
## [83] train-rmse:17.815969+1.011932   test-rmse:19.553658+3.050942 
## [84] train-rmse:17.745431+0.990408   test-rmse:19.500092+3.079763 
## [85] train-rmse:17.588155+0.959896   test-rmse:19.349590+3.034137 
## [86] train-rmse:17.468568+0.922729   test-rmse:19.239589+3.035951 
## [87] train-rmse:17.352957+0.943934   test-rmse:19.155171+2.993407 
## [88] train-rmse:17.234515+0.888955   test-rmse:19.059278+2.996461 
## [89] train-rmse:17.134723+0.831915   test-rmse:18.989274+3.025258 
## [90] train-rmse:17.054714+0.823112   test-rmse:18.930607+3.014367 
## [91] train-rmse:16.933757+0.797446   test-rmse:18.838139+2.967221 
## [92] train-rmse:16.810251+0.780437   test-rmse:18.763924+2.932832 
## [93] train-rmse:16.726257+0.747603   test-rmse:18.709282+2.959513 
## [94] train-rmse:16.714728+0.749081   test-rmse:18.700485+2.957348 
## [95] train-rmse:16.660857+0.782787   test-rmse:18.643118+2.913966 
## [96] train-rmse:16.601409+0.793485   test-rmse:18.603825+2.923953 
## [97] train-rmse:16.515798+0.767036   test-rmse:18.544441+2.948573 
## [98] train-rmse:16.505076+0.769486   test-rmse:18.549169+2.961398 
## [99] train-rmse:16.441901+0.764679   test-rmse:18.480205+2.919471 
## [100]    train-rmse:16.344944+0.733505   test-rmse:18.413179+2.907366 
## [101]    train-rmse:16.297603+0.742303   test-rmse:18.401034+2.895656 
## [102]    train-rmse:16.248525+0.697621   test-rmse:18.381703+2.924107 
## [103]    train-rmse:16.170833+0.672227   test-rmse:18.316188+2.919284 
## [104]    train-rmse:16.125601+0.672360   test-rmse:18.289758+2.922481 
## [105]    train-rmse:16.091355+0.655406   test-rmse:18.267266+2.932754 
## [106]    train-rmse:16.039416+0.685450   test-rmse:18.230667+2.895250 
## [107]    train-rmse:15.985061+0.684923   test-rmse:18.180940+2.869864 
## [108]    train-rmse:15.891629+0.674833   test-rmse:18.103950+2.862079 
## [109]    train-rmse:15.831041+0.658196   test-rmse:18.065871+2.830201 
## [110]    train-rmse:15.780720+0.656519   test-rmse:18.038937+2.813263 
## [111]    train-rmse:15.735003+0.656146   test-rmse:17.996532+2.793725 
## [112]    train-rmse:15.686032+0.629558   test-rmse:17.963634+2.824241 
## [113]    train-rmse:15.660607+0.648229   test-rmse:17.936710+2.808686 
## [114]    train-rmse:15.624948+0.626596   test-rmse:17.930227+2.812213 
## [115]    train-rmse:15.586873+0.614094   test-rmse:17.919101+2.809552 
## [116]    train-rmse:15.553331+0.594898   test-rmse:17.901181+2.799248 
## [117]    train-rmse:15.526145+0.583372   test-rmse:17.890280+2.809752 
## [118]    train-rmse:15.493679+0.591669   test-rmse:17.834781+2.790416 
## [119]    train-rmse:15.457788+0.587299   test-rmse:17.792152+2.771713 
## [120]    train-rmse:15.428371+0.588751   test-rmse:17.769917+2.780026 
## [121]    train-rmse:15.378581+0.571580   test-rmse:17.729462+2.773104 
## [122]    train-rmse:15.326485+0.562833   test-rmse:17.704956+2.765354 
## [123]    train-rmse:15.284430+0.558907   test-rmse:17.673536+2.754176 
## [124]    train-rmse:15.257213+0.552090   test-rmse:17.656544+2.773701 
## [125]    train-rmse:15.217490+0.532966   test-rmse:17.637657+2.763226 
## [126]    train-rmse:15.185554+0.528980   test-rmse:17.604966+2.750241 
## [127]    train-rmse:15.152059+0.532212   test-rmse:17.573249+2.733734 
## [128]    train-rmse:15.132156+0.525671   test-rmse:17.563147+2.743549 
## [129]    train-rmse:15.092330+0.527562   test-rmse:17.521109+2.710975 
## [130]    train-rmse:15.049512+0.519227   test-rmse:17.494369+2.684498 
## [131]    train-rmse:15.030557+0.519551   test-rmse:17.484881+2.685217 
## [132]    train-rmse:14.992615+0.511915   test-rmse:17.461533+2.666484 
## [133]    train-rmse:14.957904+0.512549   test-rmse:17.425813+2.628257 
## [134]    train-rmse:14.927430+0.526488   test-rmse:17.402691+2.614843 
## [135]    train-rmse:14.890625+0.517635   test-rmse:17.367595+2.607615 
## [136]    train-rmse:14.849312+0.520482   test-rmse:17.320028+2.566878 
## [137]    train-rmse:14.832193+0.527986   test-rmse:17.285083+2.539817 
## [138]    train-rmse:14.814702+0.524117   test-rmse:17.274416+2.524579 
## [139]    train-rmse:14.795555+0.517156   test-rmse:17.267425+2.519916 
## [140]    train-rmse:14.781310+0.514772   test-rmse:17.264531+2.512888 
## [141]    train-rmse:14.754313+0.520688   test-rmse:17.254561+2.505460 
## [142]    train-rmse:14.722536+0.510869   test-rmse:17.231961+2.495239 
## [143]    train-rmse:14.713838+0.512050   test-rmse:17.223690+2.498749 
## [144]    train-rmse:14.693744+0.503557   test-rmse:17.221588+2.496939 
## [145]    train-rmse:14.674039+0.507908   test-rmse:17.194092+2.482204 
## [146]    train-rmse:14.643539+0.513647   test-rmse:17.174214+2.467890 
## [147]    train-rmse:14.622975+0.522515   test-rmse:17.151297+2.446888 
## [148]    train-rmse:14.607763+0.522229   test-rmse:17.150460+2.442855 
## [149]    train-rmse:14.578363+0.523974   test-rmse:17.129361+2.438871 
## [150]    train-rmse:14.557407+0.513715   test-rmse:17.111304+2.452278 
## [151]    train-rmse:14.539416+0.508157   test-rmse:17.116079+2.426605 
## [152]    train-rmse:14.527219+0.499899   test-rmse:17.104953+2.435567 
## [153]    train-rmse:14.500933+0.495637   test-rmse:17.095397+2.434107 
## [154]    train-rmse:14.478177+0.495512   test-rmse:17.065030+2.422896 
## [155]    train-rmse:14.462208+0.487955   test-rmse:17.049530+2.425708 
## [156]    train-rmse:14.450373+0.492076   test-rmse:17.041880+2.434817 
## [157]    train-rmse:14.435269+0.500201   test-rmse:17.023472+2.418954 
## [158]    train-rmse:14.425387+0.501555   test-rmse:17.016158+2.426202 
## [159]    train-rmse:14.397187+0.498546   test-rmse:16.977926+2.415722 
## [160]    train-rmse:14.380515+0.496023   test-rmse:16.987451+2.409989 
## [161]    train-rmse:14.360991+0.500249   test-rmse:16.969557+2.400322 
## [162]    train-rmse:14.341098+0.494912   test-rmse:16.980213+2.390146 
## [163]    train-rmse:14.326968+0.494387   test-rmse:16.964125+2.388002 
## [164]    train-rmse:14.306137+0.507221   test-rmse:16.945581+2.373671 
## [165]    train-rmse:14.280262+0.498977   test-rmse:16.924808+2.383835 
## [166]    train-rmse:14.267403+0.498749   test-rmse:16.923469+2.374308 
## [167]    train-rmse:14.252839+0.493206   test-rmse:16.912415+2.369978 
## [168]    train-rmse:14.232668+0.486512   test-rmse:16.907730+2.374981 
## [169]    train-rmse:14.216681+0.481075   test-rmse:16.905304+2.372007 
## [170]    train-rmse:14.203565+0.483777   test-rmse:16.892802+2.368982 
## [171]    train-rmse:14.187573+0.483133   test-rmse:16.876753+2.363466 
## [172]    train-rmse:14.177369+0.488876   test-rmse:16.866253+2.355901 
## [173]    train-rmse:14.167537+0.492677   test-rmse:16.853597+2.344846 
## [174]    train-rmse:14.147822+0.492666   test-rmse:16.845005+2.350116 
## [175]    train-rmse:14.131817+0.482586   test-rmse:16.844079+2.352674 
## [176]    train-rmse:14.114651+0.477985   test-rmse:16.832447+2.344551 
## [177]    train-rmse:14.105368+0.482275   test-rmse:16.815290+2.331287 
## [178]    train-rmse:14.090013+0.479955   test-rmse:16.803516+2.327895 
## [179]    train-rmse:14.069659+0.473124   test-rmse:16.793177+2.314566 
## [180]    train-rmse:14.056932+0.475274   test-rmse:16.791821+2.319237 
## [181]    train-rmse:14.042806+0.474747   test-rmse:16.791090+2.314989 
## [182]    train-rmse:14.034702+0.474267   test-rmse:16.784820+2.314451 
## [183]    train-rmse:14.019723+0.473771   test-rmse:16.777349+2.302758 
## [184]    train-rmse:14.007086+0.470886   test-rmse:16.768306+2.293692 
## [185]    train-rmse:13.985844+0.468720   test-rmse:16.745902+2.299659 
## [186]    train-rmse:13.971545+0.468518   test-rmse:16.729957+2.294786 
## [187]    train-rmse:13.960925+0.465525   test-rmse:16.726985+2.293642 
## [188]    train-rmse:13.944976+0.469609   test-rmse:16.705727+2.296773 
## [189]    train-rmse:13.930510+0.462138   test-rmse:16.691392+2.292405 
## [190]    train-rmse:13.918630+0.460196   test-rmse:16.694666+2.288677 
## [191]    train-rmse:13.910202+0.462573   test-rmse:16.687162+2.286327 
## [192]    train-rmse:13.901193+0.464415   test-rmse:16.676747+2.268836 
## [193]    train-rmse:13.887317+0.470784   test-rmse:16.665316+2.260475 
## [194]    train-rmse:13.881275+0.471498   test-rmse:16.662690+2.261481 
## [195]    train-rmse:13.865173+0.471051   test-rmse:16.656386+2.258581 
## [196]    train-rmse:13.853303+0.469909   test-rmse:16.656032+2.253072 
## [197]    train-rmse:13.837375+0.473990   test-rmse:16.642306+2.241215 
## [198]    train-rmse:13.826791+0.468159   test-rmse:16.639687+2.243486 
## [199]    train-rmse:13.814844+0.471842   test-rmse:16.626822+2.235250 
## [200]    train-rmse:13.801977+0.467224   test-rmse:16.622494+2.239977 
## [201]    train-rmse:13.789906+0.470116   test-rmse:16.609270+2.241174 
## [202]    train-rmse:13.784012+0.472192   test-rmse:16.600250+2.235053 
## [203]    train-rmse:13.771288+0.471897   test-rmse:16.595371+2.229925 
## [204]    train-rmse:13.762517+0.469096   test-rmse:16.589808+2.234627 
## [205]    train-rmse:13.752040+0.469583   test-rmse:16.580843+2.242031 
## [206]    train-rmse:13.738587+0.477469   test-rmse:16.570398+2.225584 
## [207]    train-rmse:13.727594+0.472819   test-rmse:16.555904+2.222729 
## [208]    train-rmse:13.711985+0.475755   test-rmse:16.556365+2.213979 
## [209]    train-rmse:13.700847+0.474730   test-rmse:16.554377+2.218739 
## [210]    train-rmse:13.688887+0.466551   test-rmse:16.552167+2.216170 
## [211]    train-rmse:13.681878+0.464355   test-rmse:16.548715+2.216223 
## [212]    train-rmse:13.675945+0.461875   test-rmse:16.550040+2.213432 
## [213]    train-rmse:13.670460+0.463322   test-rmse:16.541081+2.207301 
## [214]    train-rmse:13.665846+0.463495   test-rmse:16.536567+2.209668 
## [215]    train-rmse:13.657334+0.464081   test-rmse:16.537049+2.207817 
## [216]    train-rmse:13.644747+0.466401   test-rmse:16.542992+2.204895 
## [217]    train-rmse:13.633002+0.460672   test-rmse:16.535587+2.208448 
## [218]    train-rmse:13.628133+0.461541   test-rmse:16.533015+2.209445 
## [219]    train-rmse:13.615651+0.464630   test-rmse:16.537368+2.191196 
## [220]    train-rmse:13.603342+0.464110   test-rmse:16.534401+2.196200 
## [221]    train-rmse:13.595664+0.464983   test-rmse:16.523908+2.186935 
## [222]    train-rmse:13.584610+0.461034   test-rmse:16.523848+2.198028 
## [223]    train-rmse:13.574949+0.464229   test-rmse:16.512662+2.194007 
## [224]    train-rmse:13.562371+0.460056   test-rmse:16.515580+2.192873 
## [225]    train-rmse:13.557730+0.460536   test-rmse:16.517134+2.187590 
## [226]    train-rmse:13.555253+0.460191   test-rmse:16.518044+2.182772 
## [227]    train-rmse:13.542656+0.458570   test-rmse:16.510913+2.178563 
## [228]    train-rmse:13.537943+0.458742   test-rmse:16.509326+2.175638 
## [229]    train-rmse:13.525734+0.459214   test-rmse:16.504077+2.172963 
## [230]    train-rmse:13.520411+0.455094   test-rmse:16.506975+2.164005 
## [231]    train-rmse:13.505627+0.455267   test-rmse:16.504907+2.160896 
## [232]    train-rmse:13.487082+0.448739   test-rmse:16.490809+2.155966 
## [233]    train-rmse:13.472404+0.448131   test-rmse:16.483401+2.151320 
## [234]    train-rmse:13.464612+0.450453   test-rmse:16.473264+2.147695 
## [235]    train-rmse:13.456156+0.455323   test-rmse:16.462233+2.138827 
## [236]    train-rmse:13.446421+0.454984   test-rmse:16.461894+2.134137 
## [237]    train-rmse:13.436567+0.448348   test-rmse:16.450457+2.141933 
## [238]    train-rmse:13.427033+0.448586   test-rmse:16.448161+2.142207 
## [239]    train-rmse:13.421628+0.448870   test-rmse:16.437279+2.135932 
## [240]    train-rmse:13.415139+0.447142   test-rmse:16.433787+2.138403 
## [241]    train-rmse:13.405994+0.447757   test-rmse:16.422599+2.130371 
## [242]    train-rmse:13.401769+0.449024   test-rmse:16.418126+2.133826 
## [243]    train-rmse:13.392894+0.454349   test-rmse:16.419606+2.142544 
## [244]    train-rmse:13.384944+0.457288   test-rmse:16.398016+2.140802 
## [245]    train-rmse:13.378389+0.459539   test-rmse:16.391605+2.135926 
## [246]    train-rmse:13.368225+0.459995   test-rmse:16.372633+2.136820 
## [247]    train-rmse:13.362491+0.461567   test-rmse:16.372823+2.135636 
## [248]    train-rmse:13.354589+0.461275   test-rmse:16.370292+2.124697 
## [249]    train-rmse:13.345956+0.463001   test-rmse:16.372595+2.107308 
## [250]    train-rmse:13.340325+0.461596   test-rmse:16.376694+2.106416 
## [251]    train-rmse:13.328008+0.461169   test-rmse:16.377023+2.104220 
## [252]    train-rmse:13.316365+0.464084   test-rmse:16.369592+2.101523 
## [253]    train-rmse:13.304190+0.461908   test-rmse:16.371876+2.109808 
## [254]    train-rmse:13.295709+0.466235   test-rmse:16.368368+2.104190 
## [255]    train-rmse:13.289128+0.469828   test-rmse:16.361523+2.099640 
## [256]    train-rmse:13.277624+0.468736   test-rmse:16.353057+2.101868 
## [257]    train-rmse:13.262142+0.466771   test-rmse:16.352246+2.095000 
## [258]    train-rmse:13.253695+0.461031   test-rmse:16.354334+2.093470 
## [259]    train-rmse:13.244758+0.459858   test-rmse:16.347714+2.095883 
## [260]    train-rmse:13.236525+0.452644   test-rmse:16.346155+2.095875 
## [261]    train-rmse:13.225894+0.451559   test-rmse:16.340836+2.097602 
## [262]    train-rmse:13.219438+0.455331   test-rmse:16.340812+2.097685 
## [263]    train-rmse:13.216804+0.455821   test-rmse:16.340736+2.099128 
## [264]    train-rmse:13.209374+0.454207   test-rmse:16.334024+2.103435 
## [265]    train-rmse:13.201660+0.451717   test-rmse:16.325480+2.100409 
## [266]    train-rmse:13.192049+0.454753   test-rmse:16.316149+2.097759 
## [267]    train-rmse:13.186119+0.454528   test-rmse:16.312766+2.102506 
## [268]    train-rmse:13.177428+0.452095   test-rmse:16.311884+2.102667 
## [269]    train-rmse:13.172563+0.452641   test-rmse:16.313789+2.099709 
## [270]    train-rmse:13.166367+0.457214   test-rmse:16.313426+2.100592 
## [271]    train-rmse:13.160861+0.456684   test-rmse:16.313137+2.095959 
## [272]    train-rmse:13.156450+0.454422   test-rmse:16.309204+2.093483 
## [273]    train-rmse:13.150445+0.454050   test-rmse:16.300855+2.095696 
## [274]    train-rmse:13.144804+0.451882   test-rmse:16.300345+2.093045 
## [275]    train-rmse:13.137621+0.453050   test-rmse:16.306244+2.091661 
## [276]    train-rmse:13.124538+0.450074   test-rmse:16.295264+2.092885 
## [277]    train-rmse:13.120435+0.451378   test-rmse:16.283753+2.090396 
## [278]    train-rmse:13.114510+0.450750   test-rmse:16.277290+2.090243 
## [279]    train-rmse:13.106417+0.449958   test-rmse:16.268983+2.086184 
## [280]    train-rmse:13.095379+0.449009   test-rmse:16.265088+2.081229 
## [281]    train-rmse:13.087258+0.451732   test-rmse:16.265951+2.072315 
## [282]    train-rmse:13.080428+0.455952   test-rmse:16.261627+2.066940 
## [283]    train-rmse:13.071710+0.454432   test-rmse:16.269919+2.067399 
## [284]    train-rmse:13.063776+0.459262   test-rmse:16.265101+2.070686 
## [285]    train-rmse:13.055637+0.460556   test-rmse:16.263218+2.061350 
## [286]    train-rmse:13.046682+0.454835   test-rmse:16.261486+2.067148 
## [287]    train-rmse:13.044498+0.455190   test-rmse:16.255096+2.061921 
## [288]    train-rmse:13.037150+0.458330   test-rmse:16.247475+2.055720 
## [289]    train-rmse:13.029506+0.461040   test-rmse:16.240270+2.048183 
## [290]    train-rmse:13.022403+0.463217   test-rmse:16.241232+2.040617 
## [291]    train-rmse:13.013692+0.464565   test-rmse:16.239338+2.045200 
## [292]    train-rmse:13.008839+0.466973   test-rmse:16.236719+2.045404 
## [293]    train-rmse:13.001465+0.464228   test-rmse:16.237507+2.045497 
## [294]    train-rmse:12.993142+0.462573   test-rmse:16.238162+2.054568 
## [295]    train-rmse:12.990091+0.462408   test-rmse:16.233210+2.051251 
## [296]    train-rmse:12.985127+0.459575   test-rmse:16.237067+2.045904 
## [297]    train-rmse:12.974659+0.462671   test-rmse:16.233790+2.055797 
## [298]    train-rmse:12.966478+0.459746   test-rmse:16.233379+2.064253 
## [299]    train-rmse:12.958155+0.459624   test-rmse:16.228299+2.058703 
## [300]    train-rmse:12.953615+0.461868   test-rmse:16.228533+2.060168 
## [301]    train-rmse:12.948981+0.463969   test-rmse:16.227232+2.063243 
## [302]    train-rmse:12.938451+0.464716   test-rmse:16.220514+2.054939 
## [303]    train-rmse:12.935307+0.464082   test-rmse:16.222507+2.053072 
## [304]    train-rmse:12.929210+0.461004   test-rmse:16.217026+2.053870 
## [305]    train-rmse:12.922853+0.463479   test-rmse:16.211919+2.050662 
## [306]    train-rmse:12.913915+0.466547   test-rmse:16.201673+2.044554 
## [307]    train-rmse:12.904853+0.467955   test-rmse:16.200788+2.046846 
## [308]    train-rmse:12.898973+0.468972   test-rmse:16.199649+2.050440 
## [309]    train-rmse:12.889548+0.467185   test-rmse:16.197351+2.055083 
## [310]    train-rmse:12.885032+0.464491   test-rmse:16.199704+2.047165 
## [311]    train-rmse:12.878840+0.466638   test-rmse:16.203905+2.044555 
## [312]    train-rmse:12.873646+0.465853   test-rmse:16.199739+2.040475 
## [313]    train-rmse:12.869119+0.462375   test-rmse:16.193914+2.038245 
## [314]    train-rmse:12.860591+0.462072   test-rmse:16.195541+2.037874 
## [315]    train-rmse:12.856111+0.461359   test-rmse:16.193572+2.038414 
## [316]    train-rmse:12.847521+0.462574   test-rmse:16.189207+2.039754 
## [317]    train-rmse:12.839740+0.461837   test-rmse:16.189437+2.037199 
## [318]    train-rmse:12.834369+0.458098   test-rmse:16.178224+2.039209 
## [319]    train-rmse:12.831434+0.457439   test-rmse:16.175599+2.039497 
## [320]    train-rmse:12.824470+0.455933   test-rmse:16.173334+2.037043 
## [321]    train-rmse:12.818762+0.453683   test-rmse:16.167154+2.043261 
## [322]    train-rmse:12.813258+0.454211   test-rmse:16.162203+2.044570 
## [323]    train-rmse:12.804402+0.449828   test-rmse:16.153828+2.048606 
## [324]    train-rmse:12.796005+0.448673   test-rmse:16.148750+2.047196 
## [325]    train-rmse:12.790944+0.448423   test-rmse:16.143172+2.053714 
## [326]    train-rmse:12.785119+0.444478   test-rmse:16.145898+2.050241 
## [327]    train-rmse:12.782607+0.442968   test-rmse:16.146607+2.048612 
## [328]    train-rmse:12.776804+0.437814   test-rmse:16.142835+2.054919 
## [329]    train-rmse:12.773195+0.437889   test-rmse:16.145308+2.055609 
## [330]    train-rmse:12.769137+0.437086   test-rmse:16.142038+2.057672 
## [331]    train-rmse:12.763224+0.434987   test-rmse:16.137997+2.062659 
## [332]    train-rmse:12.754114+0.433507   test-rmse:16.133295+2.066446 
## [333]    train-rmse:12.742938+0.431058   test-rmse:16.133635+2.061672 
## [334]    train-rmse:12.735674+0.426591   test-rmse:16.128568+2.057744 
## [335]    train-rmse:12.727314+0.424125   test-rmse:16.117428+2.058138 
## [336]    train-rmse:12.722519+0.425146   test-rmse:16.115464+2.059523 
## [337]    train-rmse:12.718592+0.423433   test-rmse:16.113945+2.060360 
## [338]    train-rmse:12.708732+0.423973   test-rmse:16.107391+2.062722 
## [339]    train-rmse:12.703139+0.428314   test-rmse:16.099656+2.057010 
## [340]    train-rmse:12.696273+0.430551   test-rmse:16.098899+2.051290 
## [341]    train-rmse:12.692653+0.433168   test-rmse:16.097738+2.051034 
## [342]    train-rmse:12.685670+0.428162   test-rmse:16.100797+2.042658 
## [343]    train-rmse:12.676607+0.430749   test-rmse:16.100906+2.041522 
## [344]    train-rmse:12.670462+0.430811   test-rmse:16.103498+2.049300 
## [345]    train-rmse:12.668985+0.430613   test-rmse:16.106273+2.053040 
## [346]    train-rmse:12.665708+0.433154   test-rmse:16.100660+2.048147 
## [347]    train-rmse:12.661600+0.432155   test-rmse:16.095354+2.050415 
## [348]    train-rmse:12.657759+0.432370   test-rmse:16.096956+2.048114 
## [349]    train-rmse:12.651147+0.433211   test-rmse:16.098734+2.045538 
## [350]    train-rmse:12.646782+0.435169   test-rmse:16.093733+2.043561 
## [351]    train-rmse:12.642262+0.437834   test-rmse:16.088970+2.042331 
## [352]    train-rmse:12.638031+0.439127   test-rmse:16.091049+2.045217 
## [353]    train-rmse:12.634157+0.440567   test-rmse:16.096093+2.052073 
## [354]    train-rmse:12.624360+0.444365   test-rmse:16.091861+2.045932 
## [355]    train-rmse:12.618706+0.445160   test-rmse:16.092134+2.045447 
## [356]    train-rmse:12.611787+0.446714   test-rmse:16.091228+2.043500 
## [357]    train-rmse:12.608153+0.445489   test-rmse:16.090840+2.044082 
## [358]    train-rmse:12.604913+0.446852   test-rmse:16.093468+2.042726 
## [359]    train-rmse:12.598991+0.444428   test-rmse:16.099796+2.039013 
## [360]    train-rmse:12.593523+0.446137   test-rmse:16.098150+2.033837 
## [361]    train-rmse:12.589956+0.448506   test-rmse:16.100011+2.037986 
## Stopping. Best iteration:
## [351]    train-rmse:12.642262+0.437834   test-rmse:16.088970+2.042331
best_nrounds <- cv_results$best_iteration
# Train the final model using the best number of rounds found
model_xgb <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb, dtrain)
test_pred <- predict(model_xgb, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)

# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)

train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
    "--------------------------\n",
    "Training RMSE: ", train_rmse, "\n",
    "Test RMSE: ", test_rmse, "\n",
    "Training R-squared: ", r_squared_train, "\n",
    "Test R-squared: ", r_squared_test, "\n",
    "Training MAE: ", train_mae, "\n",
    "Test MAE: ", test_mae, "\n",
    "Training MAPE: ", train_mape, "%\n",
    "Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 12.94671
## Test RMSE: 14.37189
## Training R-squared: 0.9780564
## Test R-squared: 0.9720868
## Training MAE: 6.999468
## Test MAE: 7.641593
## Training MAPE: 25.73508%
## Test MAPE: 28.05712%
residuals_train <- train_labels - train_pred
residuals_test <- test_labels - test_pred

residuals_data <- data.frame(
  Residuals = c(residuals_train, residuals_test),
  Dataset = c(rep('Training', length(residuals_train)), rep('Test', length(residuals_test)))
)

# Now plotting residuals with corrected data
ggplot(residuals_data, aes(x = Residuals, fill = Dataset)) +
  geom_histogram(binwidth = 1, position = 'identity', alpha = 0.6) +
  facet_wrap(~ Dataset) +
  ggtitle('Residuals Distribution')

# Assuming train_labels, test_labels, train_pred, and test_pred are correctly defined

# Adjusted Actual vs. Predicted Data Preparation
actual_pred_data <- data.frame(
  Actual = c(train_labels, test_labels),
  Predicted = c(train_pred, test_pred),
  Dataset = c(rep('Training', length(train_labels)), rep('Test', length(test_labels)))
)

# Plotting Actual vs. Predicted Values
ggplot(actual_pred_data, aes(x = Actual, y = Predicted, colour = Dataset)) +
  geom_point(alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'red') +
  xlab('Actual Values') +
  ylab('Predicted Values') +
  scale_colour_manual(values = c('Training' = 'blue', 'Test' = 'red')) +
  ggtitle('Actual vs. Predicted Values')

# Calculate feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_features), model = model_xgb)

# View the feature importance scores
print(importance_matrix)
##                                               Feature         Gain       Cover
##  1:                                      DOLLAR_SALES 7.160014e-01 0.416008824
##  2: ITEM_TITANS.NOURISH.WATER.BEVERAGE.WOODSY.YELLOW. 8.460387e-02 0.072248291
##  3:                                          X20SMALL 5.253331e-02 0.039518591
##  4:                                     REGION_KANSAS 2.588577e-02 0.063563389
##  5:                                       X26.32SMALL 2.140248e-02 0.007447924
##  6:                                   CALORIC_SEGMENT 1.875034e-02 0.026356971
##  7:                                   REGION_NORTHERN 1.839264e-02 0.026885309
##  8:                                           POINT5L 1.135987e-02 0.022911266
##  9: ITEM_STRONGLY.IONIC.WATER.BEVERAGE.WOODSY.YELLOW. 1.099146e-02 0.003510788
## 10:                                             MULTI 9.433681e-03 0.021345434
## 11:                                             X6ONE 5.132903e-03 0.010413045
## 12:                       BRAND_STRONGLY.ENERGY.WATER 4.417200e-03 0.029111763
## 13:                                     SEASON_SUMMER 3.310559e-03 0.018095639
## 14:                                  REGION_NEWMEXICO 3.074743e-03 0.034287331
## 15:                                   REGION_COLORADO 2.858571e-03 0.013095214
## 16:                                     SEASON_WINTER 2.382166e-03 0.022375256
## 17:                                    REGION_ARIZONA 1.989458e-03 0.030481949
## 18:                              BRAND_VITAMINAL.FLOW 1.603145e-03 0.016418292
## 19:                                   REGION_MOUNTAIN 1.128815e-03 0.018426347
## 20:                                      REGION_SOCAL 9.943884e-04 0.022579104
## 21:                                    REGION_PRAIRIE 8.475404e-04 0.026447717
## 22:                                               JUG 7.345926e-04 0.002186636
## 23:                                  REGION_DESERT_SW 7.087854e-04 0.009764727
## 24:                                       SEASON_FALL 7.024262e-04 0.009622391
## 25:                                     SEASON_SPRING 4.200505e-04 0.008589790
## 26:                                      REGION_NOCAL 1.803521e-04 0.017919968
## 27:                                REGION_CALI_NEVADA 1.078172e-04 0.009019048
## 28:                                               MLT 5.166534e-05 0.001368996
##                                               Feature         Gain       Cover
##       Frequency
##  1: 0.277308955
##  2: 0.040318800
##  3: 0.065869667
##  4: 0.032817628
##  5: 0.015471167
##  6: 0.053211439
##  7: 0.034224098
##  8: 0.052039381
##  9: 0.003750586
## 10: 0.027660572
## 11: 0.017112049
## 12: 0.033755274
## 13: 0.047819972
## 14: 0.019924988
## 15: 0.024613221
## 16: 0.041725270
## 17: 0.033286451
## 18: 0.014767932
## 19: 0.020862635
## 20: 0.015705579
## 21: 0.018049695
## 22: 0.003281763
## 23: 0.015939991
## 24: 0.032583216
## 25: 0.040084388
## 26: 0.009376465
## 27: 0.005860291
## 28: 0.002578528
##       Frequency
# Plot the feature importance
xgb.plot.importance(importance_matrix = importance_matrix)

# Compute partial dependence data for 'DOLLAR_SALES' and 'CALORIC_SEGMENT'
pd <- partial(model_xgb, pred.var = c("DOLLAR_SALES", "CALORIC_SEGMENT"), train = train_features, grid.resolution = 20)

# Default PDP
pdp1 <- plotPartial(pd, plot = TRUE)

# Add contour lines and use a different color palette
rwb <- colorRampPalette(c("red", "white", "blue"))
pdp2 <- plotPartial(pd, contour = TRUE, col.regions = rwb)

# 3-D surface
pdp3 <- plotPartial(pd, levelplot = FALSE, zlab = "Predicted Outcome", drape = TRUE, colorkey = TRUE, screen = list(z = -20, x = -60))

# Combine plots into one window
grid.arrange(pdp1, pdp2, pdp3, ncol = 3)

XGBOOST Model #2

Model with NO DOLLAR SALES Variable

# Assuming 'df' is your complete dataframe and 'UNIT_SALES' is your target variable
df2 <- df
# Remove DOLLAR_SALES from the features
df2$DOLLAR_SALES <- NULL

# Split the updated data into training and testing sets (assuming you're using a similar approach as before)
set.seed(123)
df2_testtrn <- initial_split(df2, prop = 0.8, strata = UNIT_SALES)
Train <- training(df2_testtrn)
Test <- testing(df2_testtrn)

# Prepare features and labels for XGBoost, excluding DOLLAR_SALES
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES

# Convert data to DMatrix format for XGBoost
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Assuming 'params' and 'best_nrounds' are defined as before

# Train the final model without DOLLAR_SALES
model_xgb_no_dollar_sales <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb_no_dollar_sales, dtrain)
test_pred <- predict(model_xgb_no_dollar_sales, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)

# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)

train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
    "--------------------------\n",
    "Training RMSE: ", train_rmse, "\n",
    "Test RMSE: ", test_rmse, "\n",
    "Training R-squared: ", r_squared_train, "\n",
    "Test R-squared: ", r_squared_test, "\n",
    "Training MAE: ", train_mae, "\n",
    "Test MAE: ", test_mae, "\n",
    "Training MAPE: ", train_mape, "%\n",
    "Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 57.75024
## Test RMSE: 60.45803
## Training R-squared: 0.5633861
## Test R-squared: 0.5060416
## Training MAE: 32.79925
## Test MAE: 34.51334
## Training MAPE: 196.601%
## Test MAPE: 206.9347%

A significant loss when DOLLAR_SALES is removed. But still we can see all the other features that together are very important.

# Calculate feature importance
importance_matrix2 <- xgb.importance(feature_names = colnames(train_features), model = model_xgb_no_dollar_sales)

# View the feature importance scores
print(importance_matrix2)
##                                                             Feature
##  1:               ITEM_TITANS.NOURISH.WATER.BEVERAGE.WOODSY.YELLOW.
##  2:                                                        X20SMALL
##  3:                                                   REGION_KANSAS
##  4:                                                 CALORIC_SEGMENT
##  5:                                                 REGION_NORTHERN
##  6:                                     BRAND_STRONGLY.ENERGY.WATER
##  7:                                                     X26.32SMALL
##  8:                                                         POINT5L
##  9:                                                   SEASON_SUMMER
## 10:                                                   SEASON_WINTER
## 11:                                                REGION_NEWMEXICO
## 12:                                                           X6ONE
## 13:                                                     SEASON_FALL
## 14:                                                   SEASON_SPRING
## 15:                                            BRAND_VITAMINAL.FLOW
## 16:                                                  REGION_ARIZONA
## 17:                                                           MULTI
## 18:                                                  REGION_PRAIRIE
## 19:                                                    REGION_SOCAL
## 20:                                                 REGION_MOUNTAIN
## 21:                                                 REGION_COLORADO
## 22:                                                REGION_DESERT_SW
## 23:                                              REGION_CALI_NEVADA
## 24:               ITEM_STRONGLY.IONIC.WATER.BEVERAGE.WOODSY.YELLOW.
## 25:                                                    REGION_NOCAL
## 26:                                                             JUG
## 27:                                                             MLT
## 28: ITEM_TITANS.NOURISH.WATER.NUTRIENT.ENHANCED.BVRG.WOODSY.YELLOW.
##                                                             Feature
##             Gain        Cover    Frequency
##  1: 2.662779e-01 0.0764586652 0.0533990993
##  2: 1.769269e-01 0.0516776858 0.0675530774
##  3: 1.595826e-01 0.1127620588 0.0467510187
##  4: 9.201953e-02 0.0578886801 0.0812781471
##  5: 5.986639e-02 0.0426304578 0.0360283080
##  6: 3.487443e-02 0.0254071213 0.0334548574
##  7: 3.298050e-02 0.0169064936 0.0250911430
##  8: 3.059437e-02 0.0206221847 0.0553291872
##  9: 2.233120e-02 0.0515547338 0.0782757881
## 10: 1.931196e-02 0.0491548540 0.0739867038
## 11: 1.851870e-02 0.0349538334 0.0223032383
## 12: 1.056117e-02 0.0090485515 0.0197297877
## 13: 9.596539e-03 0.0357442202 0.0613339052
## 14: 9.429947e-03 0.0354170541 0.0662663521
## 15: 8.704337e-03 0.0083981897 0.0092215312
## 16: 7.556334e-03 0.0461578166 0.0362427622
## 17: 7.540057e-03 0.0256035268 0.0242333262
## 18: 6.702594e-03 0.0475177000 0.0293802273
## 19: 6.415693e-03 0.0355916221 0.0272356852
## 20: 4.606337e-03 0.0485284474 0.0340982200
## 21: 4.017641e-03 0.0366521326 0.0257345057
## 22: 3.187877e-03 0.0452904029 0.0283079563
## 23: 2.724455e-03 0.0349067173 0.0184430624
## 24: 2.190476e-03 0.0077097116 0.0087926228
## 25: 1.587601e-03 0.0348948059 0.0233755093
## 26: 1.013986e-03 0.0051364549 0.0092215312
## 27: 8.680470e-04 0.0031889431 0.0047179927
## 28: 1.251844e-05 0.0001969349 0.0002144542
##             Gain        Cover    Frequency
xgb.plot.importance(importance_matrix = importance_matrix2)

if (!requireNamespace("pdp", quietly = TRUE)) install.packages("pdp")
if (!requireNamespace("xgboost", quietly = TRUE)) install.packages("xgboost")
library(pdp)
library(xgboost)
pdp::partial(model_xgb_no_dollar_sales, pred.var = "REGION_KANSAS", train = train_features)
##   REGION_KANSAS      yhat
## 1             0  62.04574
## 2             1 224.83023
# We know this has a negative effect on UNIT_SALES but a positive effect on DOLLAR_SALES
pdp::partial(model_xgb_no_dollar_sales, pred.var = "REGION_NORTHERN", train = train_features)
##   REGION_NORTHERN     yhat
## 1               0 59.34926
## 2               1 77.09184
pd <- partial(model_xgb_no_dollar_sales, pred.var = c("REGION_KANSAS", "REGION_NORTHERN"), train = train_features, grid.resolution = 20)

# Default PDP
pdp1 <- plotPartial(pd, plot = TRUE)

# Add contour lines and use a different color palette
rwb <- colorRampPalette(c("red", "white", "blue"))
pdp2 <- plotPartial(pd, contour = TRUE, col.regions = rwb)

# 3-D surface
pdp3 <- plotPartial(pd, levelplot = FALSE, zlab = "Predicted Outcome", drape = TRUE, colorkey = TRUE, screen = list(z = -20, x = -60))

# Combine plots into one window
grid.arrange(pdp1, pdp2, pdp3, ncol = 3)

Innovation Model 2 - Cassava

Item Description: Diet Energy Moonlit Casava 2L Multi Jug Caloric Segment: Diet Market Category: Energy Manufacturer: Swire-CC Brand: Diet Moonlit Package Type: 2L Multi Jug Flavor: ‘Cassava’ Swire plans to release this product for 6 months. What will the forecasted demand be, in weeks, for this product?

Linear Regression Review

#summary(df) Same output as shown
# create a table of total values by brand
brand_summary <- df %>%
  group_by(BRAND) %>%
  summarise(
    total_units_sold = sum(UNIT_SALES),
    total_revenue = sum(DOLLAR_SALES),
    avg_price = total_revenue / total_units_sold,
    total_days_sold = n() # Count the number of rows for each brand
  ) %>%
  arrange(desc(total_revenue)) %>%  # Order by revenue in descending order
  mutate(rank = row_number()) 

summary(brand_summary)
##     BRAND           total_units_sold   total_revenue         avg_price      
##  Length:288         Min.   :       1   Min.   :        1   Min.   : 0.5315  
##  Class :character   1st Qu.:    2310   1st Qu.:     7563   1st Qu.: 2.0861  
##  Mode  :character   Median :   94691   Median :   266075   Median : 3.0291  
##                     Mean   : 1473003   Mean   :  4989427   Mean   : 3.2661  
##                     3rd Qu.:  651385   3rd Qu.:  2161764   3rd Qu.: 3.7252  
##                     Max.   :40414038   Max.   :159387186   Max.   :42.9378  
##  total_days_sold         rank       
##  Min.   :     1.0   Min.   :  1.00  
##  1st Qu.:   121.8   1st Qu.: 72.75  
##  Median :  1988.0   Median :144.50  
##  Mean   :  8493.5   Mean   :144.50  
##  3rd Qu.:  8075.8   3rd Qu.:216.25  
##  Max.   :124603.0   Max.   :288.00
print(brand_summary[brand_summary$BRAND == "DIET MOONLIT", ])
## # A tibble: 1 × 6
##   BRAND        total_units_sold total_revenue avg_price total_days_sold  rank
##   <chr>                   <dbl>         <dbl>     <dbl>           <int> <int>
## 1 DIET MOONLIT          749702.      2623607.      3.50            7617    69

Diet Moonlit is a rising star ranking 69 out of 288 brands in terms of total revenue, with an average price of $3.50 slightly above the overall mean of $3.27.

# Filter the dataframe for only 'DIET SMASH'
filtered_df <- df %>% 
  filter(BRAND == "DIET MOONLIT")

# Create the plot
ggplot(filtered_df, aes(x = UNIT_SALES, y = DOLLAR_SALES)) +
  geom_point(color = "red", alpha = 1) +  # Bright red points with full opacity
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Add linear regression line without confidence band
  labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES for DIET SMASH",
       x = "UNIT SALES",
       y = "DOLLAR SALES") +
  theme_minimal() +
  theme(legend.position = "none")  
## `geom_smooth()` using formula = 'y ~ x'

DIET MOONLIT has a tight cluster below 1,000 unit sales and $2,500 revenue, but there are some remarkable high fliers nearing $20,000 and just over 3000 units.

filtered_df %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  ggplot(aes(x = WEEK, y = total_sales)) +
  geom_line(color = "black") +  # Blue line connecting points
  labs(title = "Total Sales by Week of the Year",
       x = "Week of the Year",
       y = "Total Unit Sales") +
  theme_minimal()

> DIET MOONLIT shows many peaks and valleys in sales by week.

library(zoo)
# Calculate total sales for each group of 211 consecutive weeks (6 months)
sales_by_group <- filtered_df %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_group$week_label <- factor(sales_by_group$week_label, levels = sales_by_group$week_label[order(sales_by_group$WEEK)])
ggplot(sales_by_group, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 6-month Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

> DIET MOONLIT has it’s best 6 month runs week 7 - 27 historically.

#find the best 21 weeks for Casava sales
# Calculate total sales for each group of 21 consecutive weeks
sales_by_casava <- df %>%
  filter(str_detect(ITEM, "CASAVA")) %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_casava$week_label <- factor(sales_by_casava$week_label, levels = sales_by_casava$week_label[order(sales_by_casava$WEEK)])
ggplot(sales_by_casava, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 21-Week Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

> Casava sales are best in the 21 weeks from week 14 to 34.

#find the best 21 weeks for casava, energy, diet
# Calculate total sales for each group of 21 consecutive weeks

sales_by_innovation <- df %>%
  filter(CATEGORY == "ENERGY",
         str_detect(ITEM, "CASAVA")) %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_innovation$week_label <- factor(sales_by_innovation$week_label, levels = sales_by_innovation$week_label[order(sales_by_innovation$WEEK)])
ggplot(sales_by_innovation, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 13-Week Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

# Make a new smaller "innovation" data frame
#create innovation based on Energy, Casava
innovation<- df %>% 
  filter(CATEGORY == "ENERGY",
         str_detect(ITEM, "CASAVA"))


#unique PACKAGE string from innovation
print(unique(innovation$PACKAGE))
## [1] "16SMALL MULTI CUP" "16SMALL 24ONE CUP" "12SMALL MULTI CUP"
library(dplyr)
library(lubridate)

innovation <- innovation %>%
  mutate(
    MONTH = month(ymd(DATE)),  # Extract month using lubridate's ymd function
    MONTH = as.factor(MONTH)   # Convert the extracted month into a factor
  )

str(innovation)
## 'data.frame':    5069 obs. of  13 variables:
##  $ MARKET_KEY     : chr  "685" "1811" "6" "212" ...
##  $ DATE           : chr  "2022-04-16" "2021-10-02" "2023-10-07" "2023-07-15" ...
##  $ CALORIC_SEGMENT: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ CATEGORY       : chr  "ENERGY" "ENERGY" "ENERGY" "ENERGY" ...
##  $ UNIT_SALES     : num  25 18 18 127 2 3 50 9 28 12 ...
##  $ DOLLAR_SALES   : num  52.58 30.02 29.5 314.43 3.34 ...
##  $ MANUFACTURER   : chr  "JOLLYS" "JOLLYS" "JOLLYS" "PONYS" ...
##  $ BRAND          : chr  "SUPER-DUPER JUICED" "SUPER-DUPER JUICED" "SUPER-DUPER RECOVERY" "MYTHICAL BEVERAGE" ...
##  $ PACKAGE        : chr  "16SMALL MULTI CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" ...
##  $ ITEM           : chr  "SUPER-DUPER JUICED ENERGY DRINK CASAVA  SUNSET  GUAVA CUP 16 LIQUID SMALL" "SUPER-DUPER JUICED ENERGY DRINK CASAVA  SUNSET  GUAVA CUP 16 LIQUID SMALL" "SUPER-DUPER RECOVERY ENERGY DRINK CASAVA  JACK  CUP 16 LIQUID SMALL" "MYTHICAL BEVERAGE RESERVE ENERGY DRINK WHITE CASAVA  CUP 16 LIQUID SMALL" ...
##  $ REGION         : chr  "COLORADO" "KANSAS" "NORTHERN" "DESERT_SW" ...
##  $ MONTH          : Factor w/ 12 levels "1","2","3","4",..: 4 10 10 7 8 3 11 4 4 9 ...
##  $ SEASON         : chr  "SPRING" "FALL" "FALL" "SUMMER" ...
print(unique(innovation$ITEM))
## [1] "SUPER-DUPER JUICED ENERGY DRINK CASAVA  SUNSET  GUAVA CUP 16 LIQUID SMALL"              
## [2] "SUPER-DUPER RECOVERY ENERGY DRINK CASAVA  JACK  CUP 16 LIQUID SMALL"                    
## [3] "MYTHICAL BEVERAGE RESERVE ENERGY DRINK WHITE CASAVA  CUP 16 LIQUID SMALL"               
## [4] "SUPER-DUPER PITAYA ED ENERGY DRINK CASAVA  NO ARTIFICIAL SWEETENERS CUP 16 LIQUID SMALL"
## [5] "MYTHICAL BEVERAGE RESERVE ENERGY DRINK WHITE CASAVA  CUP 16 LIQUID SMALL X24"           
## [6] "JUMPIN-FISH ENERGY DRINK CASAVA  JACK  CUP 12 LIQUID SMALL"
# Count the number of unique PACKAGE column of our sample
table(innovation$PACKAGE)
## 
## 12SMALL MULTI CUP 16SMALL 24ONE CUP 16SMALL MULTI CUP 
##                 3                 3              5063
# Creating an 'innovation' data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
## 
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + 
##     SEASON + REGION, data = innovation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -675.01  -11.00    3.49   19.22  881.56 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.963688  29.351687   0.169 0.865716    
## UNIT_SALES                 2.335631   0.004517 517.110  < 2e-16 ***
## CALORIC_SEGMENT                  NA         NA      NA       NA    
## PACKAGE16SMALL 24ONE CUP   9.015525  41.412865   0.218 0.827673    
## PACKAGE16SMALL MULTI CUP -10.472066  29.301506  -0.357 0.720815    
## SEASONSPRING              -0.940580   1.986640  -0.473 0.635911    
## SEASONSUMMER              -9.386618   1.981350  -4.737 2.22e-06 ***
## SEASONWINTER               0.356214   1.995442   0.179 0.858327    
## REGIONCALI_NEVADA          3.969486   4.369068   0.909 0.363635    
## REGIONCOLORADO             4.995027   2.599420   1.922 0.054713 .  
## REGIONDESERT_SW           -1.735148   3.109124  -0.558 0.576813    
## REGIONKANSAS             -31.939006   6.342170  -5.036 4.92e-07 ***
## REGIONMOUNTAIN           -22.384831   2.688393  -8.326  < 2e-16 ***
## REGIONNEWMEXICO           -2.944923   3.963758  -0.743 0.457539    
## REGIONNOCAL               11.129613   3.993008   2.787 0.005335 ** 
## REGIONNORTHERN           -11.480915   2.071048  -5.544 3.11e-08 ***
## REGIONPRAIRIE             -3.761095   4.735497  -0.794 0.427096    
## REGIONSOCAL               11.447384   2.943193   3.889 0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.69 on 5052 degrees of freedom
## Multiple R-squared:  0.9836, Adjusted R-squared:  0.9836 
## F-statistic: 1.895e+04 on 16 and 5052 DF,  p-value: < 2.2e-16

Cassava and Energy together do quite well (not possible to also add in DIET, but we will expect that folks that like Cassava Regular Energy will also like DIET). R2 of 0.98. Summer is statistically signficant, but negatively correlated with sales. SOCAL and NOCAL are significant in the positive direction.

library(dplyr)

small_group <- df %>%
  filter(UNIT_SALES < 7000, DOLLAR_SALES < 20000)

skim(small_group)
Data summary
Name small_group
Number of rows 2440830
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MARKET_KEY 0 1 1 4 0 200 0
DATE 0 1 10 10 0 152 0
CATEGORY 0 1 3 18 0 5 0
MANUFACTURER 0 1 5 8 0 8 0
BRAND 0 1 4 56 0 288 0
PACKAGE 0 1 11 26 0 95 0
ITEM 0 1 26 142 0 2999 0
REGION 0 1 5 11 0 11 0
SEASON 0 1 4 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.50 0.50 0.00 0.00 1.00 1.0 1.00 ▇▁▁▁▇
UNIT_SALES 0 1 149.70 381.66 0.04 11.00 40.00 125.0 6999.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 496.70 1224.33 0.01 36.31 134.48 423.7 19995.05 ▇▁▁▁▁
MONTH 0 1 6.28 3.43 1.00 3.00 6.00 9.0 12.00 ▇▆▆▅▇
skim(df %>% filter(BRAND == "DIET MOONLIT"))
Data summary
Name df %>% filter(BRAND == “D…
Number of rows 7617
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MARKET_KEY 0 1 1 4 0 200 0
DATE 0 1 10 10 0 147 0
CATEGORY 0 1 3 3 0 1 0
MANUFACTURER 0 1 8 8 0 1 0
BRAND 0 1 12 12 0 1 0
PACKAGE 0 1 12 17 0 5 0
ITEM 0 1 50 63 0 5 0
REGION 0 1 5 11 0 11 0
SEASON 0 1 4 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.00 0.00 0.00 0.0 0.00 0.00 0.00 ▁▁▇▁▁
UNIT_SALES 0 1 98.42 292.11 1.00 23.0 51.00 92.00 6216.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 344.44 1041.41 0.47 56.3 133.66 331.84 18382.96 ▇▁▁▁▁
MONTH 0 1 6.30 3.39 1.00 3.0 6.00 9.00 12.00 ▇▆▆▅▇

Our small df has a higher mean of unit sales and dollar sales of 149 and $496. as compared to the full df of DIET MOONLIT of 98 and $344.

# Create a scatter plot with the regression line, colored by MANUFACTURER
ggplot(small_group, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
  geom_point(alpha = 0.5) +  # Adjust alpha to avoid overplotting, if necessary
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Add linear regression line without confidence band for clarity
  labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
       x = "UNTI SALES",
       y = "DOLLAR SALES") +
  theme_minimal() +
  theme(legend.position = "bottom")  # Adjust legend position if needed
## `geom_smooth()` using formula = 'y ~ x'

Behold the realm of DIET MOONLIT. Certain items sell much better, or wosrse with consideration of slop of dollars to units sold. While most of its realm is in the lower left hand portion, other brands have sales through both its unit and dollar sales vectors.

# Investigating drinks with casava as a flavor in the Item description.
# Create a new data frame with only the rows where the ITEM column contains the word 'casava'
casava_small <- df[grep("casava", df$ITEM, ignore.case = TRUE), ]
skim(casava_small)
Data summary
Name casava_small
Number of rows 42298
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MARKET_KEY 0 1 1 4 0 200 0
DATE 0 1 10 10 0 152 0
CATEGORY 0 1 3 18 0 4 0
MANUFACTURER 0 1 5 8 0 5 0
BRAND 0 1 5 26 0 27 0
PACKAGE 0 1 12 26 0 25 0
ITEM 0 1 46 112 0 83 0
REGION 0 1 5 11 0 11 0
SEASON 0 1 4 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.60 0.49 0.00 0.00 1.00 1.00 1.00 ▅▁▁▁▇
UNIT_SALES 0 1 71.19 167.47 1.00 9.00 29.00 78.00 6678.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 184.08 376.57 0.06 23.02 75.76 210.82 12569.79 ▇▁▁▁▁
MONTH 0 1 6.48 3.34 1.00 4.00 7.00 9.00 12.00 ▇▆▆▅▇

Casava has a much lower unit sales and dollar sales at 71 and $184 than Diet Moonlight at 98 and $344.

# casava small is dataframe
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + CATEGORY + SEASON + REGION, data = casava_small)
summary(model)
## 
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + 
##     CATEGORY + SEASON + REGION, data = casava_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3586.0   -33.7    -2.8    27.4  6441.3 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        5.231e+01  3.023e+01   1.730 0.083610 .  
## UNIT_SALES                         1.989e+00  4.773e-03 416.680  < 2e-16 ***
## CALORIC_SEGMENT                    3.840e+01  2.881e+01   1.333 0.182514    
## PACKAGE.5L 6ONE JUG               -5.519e+00  6.895e+00  -0.800 0.423522    
## PACKAGE.5L MULTI JUG              -7.090e+01  1.038e+02  -0.683 0.494725    
## PACKAGE12SMALL 12ONE CUP           2.229e+02  8.384e+00  26.590  < 2e-16 ***
## PACKAGE12SMALL 24ONE PLASTICS JUG  8.004e+01  1.462e+02   0.547 0.584147    
## PACKAGE12SMALL 6ONE CUP           -5.452e+01  3.446e+01  -1.582 0.113648    
## PACKAGE12SMALL 6ONE MEDIUM CUP    -3.456e+01  1.874e+01  -1.844 0.065143 .  
## PACKAGE12SMALL 6ONE SHADYES JUG   -1.113e+02  1.463e+02  -0.761 0.446865    
## PACKAGE12SMALL 8ONE BUMPY CUP     -4.476e-01  1.759e+01  -0.025 0.979704    
## PACKAGE12SMALL 8ONE CUP            1.599e+02  2.521e+01   6.342 2.29e-10 ***
## PACKAGE12SMALL MLT MEDIUM CUP     -4.479e+01  3.987e+01  -1.123 0.261293    
## PACKAGE12SMALL MLT PLASTICS JUG   -4.235e+01  9.134e+00  -4.637 3.55e-06 ***
## PACKAGE12SMALL MULTI CUP           3.735e+01  3.820e+01   0.978 0.328159    
## PACKAGE16SMALL 24ONE CUP          -1.149e+02  8.476e+01  -1.356 0.175140    
## PACKAGE16SMALL MLT SHADYES JUG    -4.473e+01  1.462e+02  -0.306 0.759722    
## PACKAGE16SMALL MULTI CUP          -9.059e+01  8.520e+00 -10.632  < 2e-16 ***
## PACKAGE18SMALL MULTI JUG          -4.557e+01  5.189e+00  -8.784  < 2e-16 ***
## PACKAGE1L MULTI JUG               -3.327e+01  1.259e+01  -2.643 0.008227 ** 
## PACKAGE20SMALL MULTI JUG          -2.398e+01  8.341e+00  -2.875 0.004040 ** 
## PACKAGE24 - 25SMALL MULTI JUG     -5.784e+01  6.641e+00  -8.710  < 2e-16 ***
## PACKAGE24SMALL MLT SHADYES JUG    -4.853e+01  1.777e+01  -2.731 0.006318 ** 
## PACKAGE2L MULTI JUG               -6.607e+01  8.359e+00  -7.904 2.76e-15 ***
## PACKAGE3L MULTI JUG               -1.002e+02  3.892e+01  -2.575 0.010039 *  
## PACKAGE7.5SMALL 6ONE CUP          -2.666e+01  3.225e+01  -0.827 0.408391    
## PACKAGEALL OTHER ONES              6.118e+01  2.519e+01   2.429 0.015140 *  
## CATEGORYING ENHANCED WATER        -2.870e+01  2.979e+01  -0.963 0.335324    
## CATEGORYSPARKLING WATER           -1.400e+02  1.660e+01  -8.435  < 2e-16 ***
## CATEGORYSSD                       -6.955e+01  3.379e+00 -20.580  < 2e-16 ***
## SEASONSPRING                       1.956e+00  2.032e+00   0.963 0.335692    
## SEASONSUMMER                       3.807e+00  2.000e+00   1.903 0.056996 .  
## SEASONWINTER                      -7.786e+00  2.089e+00  -3.727 0.000194 ***
## REGIONCALI_NEVADA                  4.253e+00  4.093e+00   1.039 0.298752    
## REGIONCOLORADO                     2.871e+01  2.591e+00  11.082  < 2e-16 ***
## REGIONDESERT_SW                    3.952e+00  2.998e+00   1.318 0.187461    
## REGIONKANSAS                       8.784e+01  5.536e+00  15.867  < 2e-16 ***
## REGIONMOUNTAIN                     2.101e+01  2.793e+00   7.522 5.52e-14 ***
## REGIONNEWMEXICO                    2.276e+00  3.579e+00   0.636 0.524778    
## REGIONNOCAL                        7.957e-01  3.733e+00   0.213 0.831219    
## REGIONNORTHERN                     2.658e+01  2.116e+00  12.563  < 2e-16 ***
## REGIONPRAIRIE                      8.415e+00  4.522e+00   1.861 0.062764 .  
## REGIONSOCAL                        3.299e+01  2.834e+00  11.641  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 146 on 42255 degrees of freedom
## Multiple R-squared:  0.8498, Adjusted R-squared:  0.8497 
## F-statistic:  5694 on 42 and 42255 DF,  p-value: < 2.2e-16

Our Casava small has a lower R2 of 0.85, but also contains much more data with nearly 42K observations compared to our innovation df at about 5 observations. There are many signficant features, but nothing that swings the needle in huge ways.

rm(list = ls())

#head(casava_small)

DIET MOONLIT has pretty decent sales at 69th place in total revenue. Casava is not the sexiest flavor in town, but with our innovation dataframe the R2 is quite high (although it is based on regular and no specific package type). There are some weeks that look great for 6 month predictions, it’s just a matter of deciding which ones to use.

XGBOOST

# Load and prepare dataset
df <- read.csv("casava_one_hot.csv") 
# Load and prepare dataset 

#str(df)

df <- df %>% 
  #select(-DATE, -MONTH, -SEASON, -BRAND, -REGION, -ITEM )
  select(-DATE, -MONTH, -SEASON)
# Summarize the dataset
skimr::skim(df)
Data summary
Name df
Number of rows 7968
Number of columns 46
_______________________
Column type frequency:
numeric 46
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.65 0.48 0.00 0.00 1.00 1.00 1.00 ▅▁▁▁▇
UNIT_SALES 0 1 82.74 194.92 1.00 18.00 46.00 97.00 4718.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 165.06 374.63 0.25 33.23 83.97 184.78 10806.82 ▇▁▁▁▁
ENERGY_DRINK 0 1 0.64 0.48 0.00 0.00 1.00 1.00 1.00 ▅▁▁▁▇
NO_ARTIFICIAL_SWEETNERS 0 1 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
X16SMALL.24ONE.CUP 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
X16SMALL.MULTI.CUP 0 1 0.64 0.48 0.00 0.00 1.00 1.00 1.00 ▅▁▁▁▇
X12SMALL.MULTI.CUP 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
X2L.MULTI.JUG 0 1 0.36 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
SUNSET 0 1 0.57 0.50 0.00 0.00 1.00 1.00 1.00 ▆▁▁▁▇
BLAST 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
JUICED 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
GUAVA 0 1 0.21 0.41 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
RECOVERY 0 1 0.13 0.33 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
JACK 0 1 0.13 0.33 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
RESERVE 0 1 0.23 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
WHITE 0 1 0.23 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
PITAYA 0 1 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
ED 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
CASAVA 0 1 0.65 0.48 0.00 0.00 1.00 1.00 1.00 ▅▁▁▁▇
BRAND_DIET.MOONLIT 0 1 0.35 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
BRAND_HILL.MOISTURE.JUMPIN.FISH 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
BRAND_MOONLIT 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
BRAND_MYTHICAL.BEVERAGE 0 1 0.23 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
BRAND_SUPER.DUPER.JUICED 0 1 0.21 0.41 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
BRAND_SUPER.DUPER.PUNCHED 0 1 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
BRAND_SUPER.DUPER.RECOVERY 0 1 0.13 0.33 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
WEEK_OF_YEAR 0 1 26.89 15.14 1.00 14.00 27.00 40.00 53.00 ▇▇▇▇▇
REGION_ARIZONA 0 1 0.22 0.41 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
REGION_CALI_NEVADA 0 1 0.04 0.18 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_COLORADO 0 1 0.12 0.32 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_DESERT_SW 0 1 0.07 0.25 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_KANSAS 0 1 0.02 0.13 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_MOUNTAIN 0 1 0.10 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_NEWMEXICO 0 1 0.04 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_NOCAL 0 1 0.04 0.19 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_NORTHERN 0 1 0.26 0.44 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▃
REGION_PRAIRIE 0 1 0.02 0.16 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REGION_SOCAL 0 1 0.08 0.27 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
ITEM_JUMPIN.FISH..CASAVA.JACK 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
ITEM_MOONLIT..CASAVA 0 1 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
ITEM_MOONLIT..SUNSET 0 1 0.35 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
ITEM_MYTHICAL.BEVERAGE.RESERVE..WHITE.CASAVA 0 1 0.23 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
ITEM_SUPER.DUPER.JUIC..CASAVA.SUNSET.GUAVA 0 1 0.21 0.41 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
ITEM_SUPER.DUPER.PITAYA…CASAVA. 0 1 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
ITEM_SUPER.DUPER.RECOVERY..CASAVA.JACK 0 1 0.13 0.33 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
#remove top one percent of unit sales to clean up outliers
df <- df %>% 
  filter(UNIT_SALES < quantile(UNIT_SALES, 0.99))
# Split the data
set.seed(123)
df_testtrn <- initial_split(df, prop = 0.8, strata = UNIT_SALES)
Train <- training(df_testtrn)
Test <- testing(df_testtrn)

# Prepare features and labels for XGBoost
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES

# Convert data to DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Define XGBoost parameters
set.seed(123)
params <- list(
  booster = "gbtree",
  objective = "reg:squarederror",
  eval_metric = "rmse",
  eta = 0.05,
  max_depth = 4,
  min_child_weight = 3,
  subsample = 0.7,
  colsample_bytree = 0.6,
  lambda = 1,
  alpha = 1
)
# Perform cross-validation to find the optimal number of boosting rounds
cv_results <- xgb.cv(
  params = params,
  data = dtrain,  
  nfold = 5,
  nrounds = 500,  # Changed from 'num_boost_round' to 'nrounds'
  early_stopping_rounds = 10,
  metrics = "rmse",
  seed = 123
)
## [1]  train-rmse:94.423501+0.564879   test-rmse:94.429789+2.236629 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:89.908282+0.530045   test-rmse:89.939602+2.123119 
## [3]  train-rmse:86.177544+0.716596   test-rmse:86.218138+2.130073 
## [4]  train-rmse:82.613269+0.737682   test-rmse:82.644481+1.576368 
## [5]  train-rmse:79.238916+0.526982   test-rmse:79.269607+1.603306 
## [6]  train-rmse:76.271102+0.915397   test-rmse:76.314502+0.952345 
## [7]  train-rmse:73.478895+0.952286   test-rmse:73.539239+1.368307 
## [8]  train-rmse:70.289821+0.944773   test-rmse:70.355578+1.755051 
## [9]  train-rmse:67.536899+1.243970   test-rmse:67.607092+2.042894 
## [10] train-rmse:64.652668+1.422891   test-rmse:64.721507+1.871742 
## [11] train-rmse:62.350805+0.985078   test-rmse:62.426439+1.421181 
## [12] train-rmse:60.192237+0.662183   test-rmse:60.282419+1.696546 
## [13] train-rmse:58.136834+0.661374   test-rmse:58.230704+1.145594 
## [14] train-rmse:55.699524+0.548763   test-rmse:55.817174+1.092142 
## [15] train-rmse:53.850068+0.699961   test-rmse:53.981021+1.042214 
## [16] train-rmse:51.876980+0.414495   test-rmse:52.023382+1.347430 
## [17] train-rmse:50.197896+0.838561   test-rmse:50.359869+1.619395 
## [18] train-rmse:48.591621+0.586041   test-rmse:48.743114+1.627147 
## [19] train-rmse:47.064947+0.536198   test-rmse:47.220069+1.278363 
## [20] train-rmse:45.413550+0.790492   test-rmse:45.577684+0.889518 
## [21] train-rmse:43.840890+0.532687   test-rmse:44.020071+1.107817 
## [22] train-rmse:42.135188+0.565035   test-rmse:42.302897+0.882858 
## [23] train-rmse:40.703137+0.725647   test-rmse:40.897130+0.865613 
## [24] train-rmse:39.545894+0.599196   test-rmse:39.757911+0.781891 
## [25] train-rmse:38.234973+0.148406   test-rmse:38.454664+0.909886 
## [26] train-rmse:36.826687+0.478578   test-rmse:37.064830+0.773145 
## [27] train-rmse:35.842496+0.740995   test-rmse:36.087225+0.996049 
## [28] train-rmse:34.742250+1.092850   test-rmse:34.998586+1.284910 
## [29] train-rmse:33.843980+1.002732   test-rmse:34.121626+1.493937 
## [30] train-rmse:32.624164+0.766119   test-rmse:32.917330+1.178317 
## [31] train-rmse:31.647128+0.898285   test-rmse:31.955950+0.935267 
## [32] train-rmse:30.891688+1.114189   test-rmse:31.213571+0.767665 
## [33] train-rmse:30.186221+1.379518   test-rmse:30.519200+0.958765 
## [34] train-rmse:29.523203+1.616932   test-rmse:29.862204+0.974877 
## [35] train-rmse:28.538163+1.539523   test-rmse:28.892079+0.900026 
## [36] train-rmse:27.762326+1.498970   test-rmse:28.127804+0.885392 
## [37] train-rmse:26.994302+1.267090   test-rmse:27.389614+0.669507 
## [38] train-rmse:26.305485+1.383793   test-rmse:26.719281+0.894376 
## [39] train-rmse:25.663855+1.567955   test-rmse:26.100658+1.033330 
## [40] train-rmse:25.048856+1.714990   test-rmse:25.485624+1.259763 
## [41] train-rmse:24.342306+1.873567   test-rmse:24.796985+1.385206 
## [42] train-rmse:23.923081+2.027080   test-rmse:24.385598+1.516420 
## [43] train-rmse:23.409364+2.207352   test-rmse:23.885990+1.741626 
## [44] train-rmse:22.746796+2.066117   test-rmse:23.246128+1.560959 
## [45] train-rmse:22.091585+1.837371   test-rmse:22.607085+1.386395 
## [46] train-rmse:21.731753+1.905501   test-rmse:22.256820+1.545300 
## [47] train-rmse:21.377451+1.948375   test-rmse:21.922120+1.552681 
## [48] train-rmse:20.725071+1.840607   test-rmse:21.290284+1.465257 
## [49] train-rmse:20.191422+1.664769   test-rmse:20.775562+1.353613 
## [50] train-rmse:19.993049+1.694086   test-rmse:20.591168+1.431328 
## [51] train-rmse:19.532708+1.613293   test-rmse:20.153989+1.417581 
## [52] train-rmse:19.310361+1.405906   test-rmse:19.949748+1.257371 
## [53] train-rmse:18.791059+1.326396   test-rmse:19.443868+1.202560 
## [54] train-rmse:18.366828+1.180110   test-rmse:19.040862+1.135314 
## [55] train-rmse:18.037199+1.037793   test-rmse:18.735318+1.026703 
## [56] train-rmse:17.807785+0.942841   test-rmse:18.516456+0.916915 
## [57] train-rmse:17.504379+0.830789   test-rmse:18.224838+0.734766 
## [58] train-rmse:17.247465+0.936421   test-rmse:17.963856+0.731252 
## [59] train-rmse:16.861073+0.884173   test-rmse:17.590704+0.694104 
## [60] train-rmse:16.503434+0.843735   test-rmse:17.236142+0.677478 
## [61] train-rmse:16.270743+0.773643   test-rmse:17.024095+0.661509 
## [62] train-rmse:15.946012+0.730867   test-rmse:16.708869+0.640579 
## [63] train-rmse:15.690874+0.639437   test-rmse:16.462341+0.628947 
## [64] train-rmse:15.550946+0.664645   test-rmse:16.329943+0.633535 
## [65] train-rmse:15.395232+0.555433   test-rmse:16.186896+0.532441 
## [66] train-rmse:15.186364+0.609705   test-rmse:15.983035+0.567788 
## [67] train-rmse:14.934099+0.580698   test-rmse:15.739795+0.550844 
## [68] train-rmse:14.859255+0.596816   test-rmse:15.671315+0.517718 
## [69] train-rmse:14.712069+0.624446   test-rmse:15.533829+0.588984 
## [70] train-rmse:14.609424+0.674122   test-rmse:15.440851+0.545555 
## [71] train-rmse:14.431689+0.652090   test-rmse:15.270005+0.493945 
## [72] train-rmse:14.214652+0.614051   test-rmse:15.068251+0.475983 
## [73] train-rmse:14.101901+0.678612   test-rmse:14.958368+0.483423 
## [74] train-rmse:13.940474+0.616369   test-rmse:14.810769+0.485027 
## [75] train-rmse:13.839382+0.566226   test-rmse:14.723778+0.440718 
## [76] train-rmse:13.776529+0.541372   test-rmse:14.672535+0.465048 
## [77] train-rmse:13.634640+0.532002   test-rmse:14.541367+0.435932 
## [78] train-rmse:13.562947+0.569526   test-rmse:14.475028+0.436714 
## [79] train-rmse:13.480880+0.563831   test-rmse:14.400802+0.441132 
## [80] train-rmse:13.399968+0.515959   test-rmse:14.333035+0.397400 
## [81] train-rmse:13.311269+0.526961   test-rmse:14.241335+0.388183 
## [82] train-rmse:13.238851+0.501600   test-rmse:14.180451+0.354451 
## [83] train-rmse:13.153383+0.506962   test-rmse:14.101717+0.364892 
## [84] train-rmse:13.040299+0.468068   test-rmse:13.994847+0.323290 
## [85] train-rmse:12.976276+0.440868   test-rmse:13.934211+0.272554 
## [86] train-rmse:12.876001+0.440570   test-rmse:13.843466+0.245439 
## [87] train-rmse:12.837833+0.446379   test-rmse:13.813237+0.267606 
## [88] train-rmse:12.726634+0.421145   test-rmse:13.713326+0.263538 
## [89] train-rmse:12.632237+0.391109   test-rmse:13.630191+0.273255 
## [90] train-rmse:12.583124+0.381197   test-rmse:13.586990+0.285065 
## [91] train-rmse:12.507804+0.393508   test-rmse:13.517249+0.306539 
## [92] train-rmse:12.436534+0.345306   test-rmse:13.455647+0.323864 
## [93] train-rmse:12.382366+0.354636   test-rmse:13.401069+0.318172 
## [94] train-rmse:12.307700+0.324397   test-rmse:13.332952+0.322444 
## [95] train-rmse:12.259935+0.285180   test-rmse:13.291947+0.332098 
## [96] train-rmse:12.185378+0.258604   test-rmse:13.228072+0.338491 
## [97] train-rmse:12.132774+0.225158   test-rmse:13.185038+0.350379 
## [98] train-rmse:12.109157+0.231535   test-rmse:13.166568+0.358834 
## [99] train-rmse:12.049080+0.211221   test-rmse:13.109678+0.364546 
## [100]    train-rmse:12.025283+0.219898   test-rmse:13.086996+0.343101 
## [101]    train-rmse:11.996640+0.220021   test-rmse:13.054290+0.347579 
## [102]    train-rmse:11.937103+0.204931   test-rmse:13.002243+0.329586 
## [103]    train-rmse:11.913837+0.197424   test-rmse:12.983553+0.345856 
## [104]    train-rmse:11.864962+0.212852   test-rmse:12.934146+0.343722 
## [105]    train-rmse:11.833100+0.216399   test-rmse:12.909519+0.349901 
## [106]    train-rmse:11.780634+0.197509   test-rmse:12.865007+0.352172 
## [107]    train-rmse:11.740214+0.186502   test-rmse:12.830175+0.333159 
## [108]    train-rmse:11.699520+0.175699   test-rmse:12.800282+0.326029 
## [109]    train-rmse:11.654255+0.179992   test-rmse:12.762058+0.307050 
## [110]    train-rmse:11.612281+0.188797   test-rmse:12.726122+0.309783 
## [111]    train-rmse:11.585316+0.199804   test-rmse:12.699226+0.300895 
## [112]    train-rmse:11.534276+0.191970   test-rmse:12.658899+0.296237 
## [113]    train-rmse:11.506294+0.186861   test-rmse:12.637115+0.289781 
## [114]    train-rmse:11.483627+0.179950   test-rmse:12.619334+0.283347 
## [115]    train-rmse:11.447396+0.164999   test-rmse:12.596259+0.287002 
## [116]    train-rmse:11.417804+0.149416   test-rmse:12.572006+0.298226 
## [117]    train-rmse:11.390369+0.149782   test-rmse:12.549410+0.285281 
## [118]    train-rmse:11.360572+0.156754   test-rmse:12.522825+0.269345 
## [119]    train-rmse:11.336751+0.159148   test-rmse:12.504516+0.258688 
## [120]    train-rmse:11.302789+0.159444   test-rmse:12.479599+0.257272 
## [121]    train-rmse:11.274362+0.153231   test-rmse:12.457078+0.258326 
## [122]    train-rmse:11.251039+0.158820   test-rmse:12.439705+0.250874 
## [123]    train-rmse:11.223312+0.150795   test-rmse:12.415422+0.250744 
## [124]    train-rmse:11.206941+0.144019   test-rmse:12.400610+0.254780 
## [125]    train-rmse:11.183969+0.139203   test-rmse:12.384747+0.258436 
## [126]    train-rmse:11.168178+0.133761   test-rmse:12.367008+0.265690 
## [127]    train-rmse:11.147889+0.136341   test-rmse:12.352210+0.261907 
## [128]    train-rmse:11.120722+0.134464   test-rmse:12.330946+0.267542 
## [129]    train-rmse:11.099491+0.130357   test-rmse:12.318929+0.268324 
## [130]    train-rmse:11.085107+0.125664   test-rmse:12.303486+0.268993 
## [131]    train-rmse:11.067933+0.130130   test-rmse:12.290083+0.263536 
## [132]    train-rmse:11.049719+0.123584   test-rmse:12.274784+0.270379 
## [133]    train-rmse:11.036683+0.127678   test-rmse:12.266480+0.267169 
## [134]    train-rmse:11.019950+0.128477   test-rmse:12.252922+0.266477 
## [135]    train-rmse:11.001782+0.127762   test-rmse:12.238841+0.269840 
## [136]    train-rmse:10.989990+0.129173   test-rmse:12.230877+0.269184 
## [137]    train-rmse:10.980398+0.123641   test-rmse:12.223034+0.273530 
## [138]    train-rmse:10.958876+0.124088   test-rmse:12.206723+0.267360 
## [139]    train-rmse:10.943984+0.127018   test-rmse:12.194405+0.261342 
## [140]    train-rmse:10.928780+0.124809   test-rmse:12.184284+0.261544 
## [141]    train-rmse:10.913462+0.128738   test-rmse:12.172107+0.258786 
## [142]    train-rmse:10.905197+0.129417   test-rmse:12.167340+0.261002 
## [143]    train-rmse:10.893687+0.125926   test-rmse:12.160883+0.260992 
## [144]    train-rmse:10.880757+0.130225   test-rmse:12.150296+0.258350 
## [145]    train-rmse:10.864188+0.123246   test-rmse:12.134566+0.267313 
## [146]    train-rmse:10.849881+0.118824   test-rmse:12.124681+0.269151 
## [147]    train-rmse:10.836052+0.116219   test-rmse:12.120501+0.266855 
## [148]    train-rmse:10.823485+0.120433   test-rmse:12.113682+0.263464 
## [149]    train-rmse:10.814602+0.119949   test-rmse:12.107081+0.265855 
## [150]    train-rmse:10.802850+0.118043   test-rmse:12.096366+0.266058 
## [151]    train-rmse:10.792704+0.118098   test-rmse:12.087628+0.269421 
## [152]    train-rmse:10.779634+0.116968   test-rmse:12.081814+0.266354 
## [153]    train-rmse:10.770145+0.117520   test-rmse:12.072839+0.263568 
## [154]    train-rmse:10.757891+0.119364   test-rmse:12.064592+0.264345 
## [155]    train-rmse:10.746328+0.117855   test-rmse:12.058567+0.264535 
## [156]    train-rmse:10.733905+0.118792   test-rmse:12.050601+0.265413 
## [157]    train-rmse:10.726037+0.117383   test-rmse:12.047794+0.262352 
## [158]    train-rmse:10.716904+0.115650   test-rmse:12.043160+0.261114 
## [159]    train-rmse:10.704968+0.118632   test-rmse:12.035953+0.256891 
## [160]    train-rmse:10.695758+0.121593   test-rmse:12.029342+0.254690 
## [161]    train-rmse:10.686856+0.122063   test-rmse:12.021465+0.254712 
## [162]    train-rmse:10.671476+0.121936   test-rmse:12.012004+0.261159 
## [163]    train-rmse:10.657978+0.120141   test-rmse:12.002476+0.264562 
## [164]    train-rmse:10.643835+0.118957   test-rmse:11.996022+0.265270 
## [165]    train-rmse:10.634998+0.122638   test-rmse:11.991590+0.262836 
## [166]    train-rmse:10.623520+0.122395   test-rmse:11.983958+0.264650 
## [167]    train-rmse:10.608486+0.123443   test-rmse:11.975458+0.262516 
## [168]    train-rmse:10.598612+0.121364   test-rmse:11.970483+0.263055 
## [169]    train-rmse:10.585749+0.124381   test-rmse:11.965071+0.262456 
## [170]    train-rmse:10.575715+0.120026   test-rmse:11.958817+0.263633 
## [171]    train-rmse:10.565400+0.121808   test-rmse:11.952401+0.262631 
## [172]    train-rmse:10.554555+0.121021   test-rmse:11.943496+0.266167 
## [173]    train-rmse:10.544097+0.121158   test-rmse:11.937556+0.264368 
## [174]    train-rmse:10.536341+0.118625   test-rmse:11.931132+0.266129 
## [175]    train-rmse:10.527975+0.118904   test-rmse:11.925092+0.269025 
## [176]    train-rmse:10.521286+0.120177   test-rmse:11.919370+0.266214 
## [177]    train-rmse:10.514935+0.120988   test-rmse:11.912995+0.264493 
## [178]    train-rmse:10.507321+0.121856   test-rmse:11.907298+0.263496 
## [179]    train-rmse:10.498849+0.122030   test-rmse:11.901932+0.266602 
## [180]    train-rmse:10.491328+0.122195   test-rmse:11.899815+0.267174 
## [181]    train-rmse:10.486488+0.121816   test-rmse:11.895876+0.267388 
## [182]    train-rmse:10.480052+0.121095   test-rmse:11.892453+0.268777 
## [183]    train-rmse:10.469214+0.118359   test-rmse:11.885857+0.266848 
## [184]    train-rmse:10.458882+0.117653   test-rmse:11.882783+0.263278 
## [185]    train-rmse:10.449346+0.119374   test-rmse:11.873769+0.262627 
## [186]    train-rmse:10.441686+0.119436   test-rmse:11.869707+0.263850 
## [187]    train-rmse:10.434410+0.119637   test-rmse:11.864702+0.262681 
## [188]    train-rmse:10.427859+0.121835   test-rmse:11.858067+0.260299 
## [189]    train-rmse:10.420511+0.119828   test-rmse:11.853283+0.260302 
## [190]    train-rmse:10.416475+0.121323   test-rmse:11.852892+0.260348 
## [191]    train-rmse:10.407528+0.122435   test-rmse:11.847589+0.263252 
## [192]    train-rmse:10.401573+0.122790   test-rmse:11.845284+0.264532 
## [193]    train-rmse:10.394319+0.121953   test-rmse:11.842149+0.263996 
## [194]    train-rmse:10.386737+0.120174   test-rmse:11.838062+0.268013 
## [195]    train-rmse:10.379160+0.119693   test-rmse:11.833990+0.266249 
## [196]    train-rmse:10.369201+0.116919   test-rmse:11.828442+0.268307 
## [197]    train-rmse:10.363024+0.118158   test-rmse:11.824329+0.266970 
## [198]    train-rmse:10.353670+0.119726   test-rmse:11.817522+0.269475 
## [199]    train-rmse:10.347972+0.120360   test-rmse:11.812574+0.269308 
## [200]    train-rmse:10.343670+0.119633   test-rmse:11.810317+0.267702 
## [201]    train-rmse:10.336491+0.117971   test-rmse:11.801966+0.272140 
## [202]    train-rmse:10.331642+0.116164   test-rmse:11.800312+0.272235 
## [203]    train-rmse:10.326470+0.115138   test-rmse:11.796186+0.270347 
## [204]    train-rmse:10.322500+0.115728   test-rmse:11.795384+0.268866 
## [205]    train-rmse:10.314340+0.116658   test-rmse:11.791258+0.265634 
## [206]    train-rmse:10.306642+0.113469   test-rmse:11.783036+0.266227 
## [207]    train-rmse:10.296888+0.112116   test-rmse:11.778263+0.268071 
## [208]    train-rmse:10.289737+0.109828   test-rmse:11.776361+0.270061 
## [209]    train-rmse:10.282643+0.109563   test-rmse:11.771807+0.269347 
## [210]    train-rmse:10.275083+0.109745   test-rmse:11.769083+0.271272 
## [211]    train-rmse:10.269478+0.109440   test-rmse:11.769812+0.273000 
## [212]    train-rmse:10.261988+0.106413   test-rmse:11.766275+0.275687 
## [213]    train-rmse:10.253878+0.105076   test-rmse:11.763889+0.278427 
## [214]    train-rmse:10.247451+0.106507   test-rmse:11.762031+0.277255 
## [215]    train-rmse:10.241958+0.105642   test-rmse:11.761870+0.277619 
## [216]    train-rmse:10.237699+0.107114   test-rmse:11.759694+0.277054 
## [217]    train-rmse:10.231891+0.105062   test-rmse:11.758128+0.279839 
## [218]    train-rmse:10.225415+0.104878   test-rmse:11.757132+0.276595 
## [219]    train-rmse:10.218398+0.103468   test-rmse:11.754181+0.279821 
## [220]    train-rmse:10.213049+0.103789   test-rmse:11.750556+0.281062 
## [221]    train-rmse:10.206493+0.103893   test-rmse:11.747822+0.280605 
## [222]    train-rmse:10.202820+0.103726   test-rmse:11.746540+0.281192 
## [223]    train-rmse:10.197924+0.104611   test-rmse:11.745034+0.279740 
## [224]    train-rmse:10.189635+0.105291   test-rmse:11.741534+0.279473 
## [225]    train-rmse:10.183359+0.103895   test-rmse:11.738969+0.282125 
## [226]    train-rmse:10.177005+0.105954   test-rmse:11.736057+0.282463 
## [227]    train-rmse:10.170425+0.106724   test-rmse:11.732551+0.283410 
## [228]    train-rmse:10.166146+0.108250   test-rmse:11.730971+0.282809 
## [229]    train-rmse:10.158921+0.108680   test-rmse:11.728621+0.284619 
## [230]    train-rmse:10.152107+0.107596   test-rmse:11.725676+0.286654 
## [231]    train-rmse:10.147019+0.107159   test-rmse:11.722212+0.288108 
## [232]    train-rmse:10.140885+0.108243   test-rmse:11.717070+0.284521 
## [233]    train-rmse:10.134435+0.109644   test-rmse:11.714316+0.285895 
## [234]    train-rmse:10.127139+0.107514   test-rmse:11.712008+0.286631 
## [235]    train-rmse:10.122030+0.108286   test-rmse:11.712499+0.290981 
## [236]    train-rmse:10.117738+0.109970   test-rmse:11.710000+0.291370 
## [237]    train-rmse:10.111053+0.110876   test-rmse:11.705222+0.292922 
## [238]    train-rmse:10.105943+0.110041   test-rmse:11.702076+0.294920 
## [239]    train-rmse:10.099386+0.109495   test-rmse:11.701959+0.295016 
## [240]    train-rmse:10.094201+0.110254   test-rmse:11.701848+0.295203 
## [241]    train-rmse:10.089585+0.111734   test-rmse:11.699849+0.295362 
## [242]    train-rmse:10.085250+0.112879   test-rmse:11.698393+0.295805 
## [243]    train-rmse:10.080497+0.112772   test-rmse:11.697715+0.295748 
## [244]    train-rmse:10.074614+0.112957   test-rmse:11.694189+0.296022 
## [245]    train-rmse:10.069259+0.111278   test-rmse:11.694641+0.294587 
## [246]    train-rmse:10.062057+0.113062   test-rmse:11.693160+0.296718 
## [247]    train-rmse:10.058178+0.113877   test-rmse:11.692823+0.297884 
## [248]    train-rmse:10.053034+0.114716   test-rmse:11.692909+0.297321 
## [249]    train-rmse:10.047810+0.115184   test-rmse:11.692634+0.296580 
## [250]    train-rmse:10.042583+0.114706   test-rmse:11.693211+0.295669 
## [251]    train-rmse:10.037320+0.115024   test-rmse:11.692880+0.296027 
## [252]    train-rmse:10.032092+0.115114   test-rmse:11.690263+0.296860 
## [253]    train-rmse:10.025446+0.114803   test-rmse:11.687016+0.295696 
## [254]    train-rmse:10.020170+0.112639   test-rmse:11.687999+0.292155 
## [255]    train-rmse:10.015513+0.111358   test-rmse:11.685568+0.291215 
## [256]    train-rmse:10.011118+0.111030   test-rmse:11.682470+0.293321 
## [257]    train-rmse:10.006122+0.113597   test-rmse:11.679468+0.292517 
## [258]    train-rmse:10.001034+0.113699   test-rmse:11.678239+0.293727 
## [259]    train-rmse:9.992893+0.113234    test-rmse:11.674524+0.295671 
## [260]    train-rmse:9.986237+0.114196    test-rmse:11.671711+0.297473 
## [261]    train-rmse:9.980375+0.112008    test-rmse:11.670387+0.297444 
## [262]    train-rmse:9.975897+0.111892    test-rmse:11.667394+0.295076 
## [263]    train-rmse:9.971651+0.112937    test-rmse:11.666498+0.292377 
## [264]    train-rmse:9.966813+0.113842    test-rmse:11.663101+0.290774 
## [265]    train-rmse:9.962066+0.112874    test-rmse:11.661545+0.290718 
## [266]    train-rmse:9.957288+0.113517    test-rmse:11.661857+0.290329 
## [267]    train-rmse:9.951921+0.113189    test-rmse:11.659783+0.289971 
## [268]    train-rmse:9.948524+0.112775    test-rmse:11.659185+0.285973 
## [269]    train-rmse:9.944772+0.113035    test-rmse:11.658491+0.286831 
## [270]    train-rmse:9.939231+0.112182    test-rmse:11.657093+0.288896 
## [271]    train-rmse:9.933861+0.111107    test-rmse:11.654726+0.290305 
## [272]    train-rmse:9.929718+0.111316    test-rmse:11.653138+0.290309 
## [273]    train-rmse:9.927375+0.111445    test-rmse:11.652688+0.288242 
## [274]    train-rmse:9.924071+0.113083    test-rmse:11.651996+0.287778 
## [275]    train-rmse:9.919168+0.112545    test-rmse:11.648013+0.286619 
## [276]    train-rmse:9.915495+0.112335    test-rmse:11.646905+0.286433 
## [277]    train-rmse:9.909329+0.112322    test-rmse:11.645413+0.287090 
## [278]    train-rmse:9.904764+0.109135    test-rmse:11.641793+0.294109 
## [279]    train-rmse:9.898694+0.109091    test-rmse:11.641079+0.296029 
## [280]    train-rmse:9.892005+0.110085    test-rmse:11.639135+0.297070 
## [281]    train-rmse:9.886005+0.111373    test-rmse:11.636627+0.297516 
## [282]    train-rmse:9.880045+0.111247    test-rmse:11.633712+0.297749 
## [283]    train-rmse:9.875419+0.111184    test-rmse:11.633593+0.297874 
## [284]    train-rmse:9.870295+0.110531    test-rmse:11.629010+0.298539 
## [285]    train-rmse:9.866392+0.110652    test-rmse:11.627810+0.299442 
## [286]    train-rmse:9.864210+0.110679    test-rmse:11.627153+0.300475 
## [287]    train-rmse:9.859903+0.110822    test-rmse:11.627588+0.299888 
## [288]    train-rmse:9.854349+0.109776    test-rmse:11.629237+0.301193 
## [289]    train-rmse:9.850878+0.110148    test-rmse:11.631062+0.303298 
## [290]    train-rmse:9.844649+0.109702    test-rmse:11.629525+0.301895 
## [291]    train-rmse:9.840848+0.110093    test-rmse:11.628663+0.302994 
## [292]    train-rmse:9.836114+0.110758    test-rmse:11.627901+0.298038 
## [293]    train-rmse:9.833322+0.111288    test-rmse:11.628910+0.298035 
## [294]    train-rmse:9.829116+0.110059    test-rmse:11.625981+0.298663 
## [295]    train-rmse:9.824431+0.109547    test-rmse:11.623817+0.295006 
## [296]    train-rmse:9.820727+0.109272    test-rmse:11.623574+0.296571 
## [297]    train-rmse:9.816866+0.110528    test-rmse:11.623656+0.297468 
## [298]    train-rmse:9.811853+0.111282    test-rmse:11.622669+0.299796 
## [299]    train-rmse:9.807711+0.110145    test-rmse:11.624863+0.303061 
## [300]    train-rmse:9.802438+0.109806    test-rmse:11.624078+0.303363 
## [301]    train-rmse:9.794488+0.109722    test-rmse:11.622248+0.299770 
## [302]    train-rmse:9.792505+0.108976    test-rmse:11.618906+0.300238 
## [303]    train-rmse:9.787433+0.109769    test-rmse:11.619850+0.300772 
## [304]    train-rmse:9.783507+0.110596    test-rmse:11.618402+0.301857 
## [305]    train-rmse:9.778125+0.110629    test-rmse:11.617518+0.301866 
## [306]    train-rmse:9.773852+0.109781    test-rmse:11.614607+0.301967 
## [307]    train-rmse:9.770200+0.110850    test-rmse:11.614894+0.303174 
## [308]    train-rmse:9.766558+0.109987    test-rmse:11.614656+0.302950 
## [309]    train-rmse:9.761032+0.112371    test-rmse:11.614210+0.302769 
## [310]    train-rmse:9.756223+0.113247    test-rmse:11.614787+0.304447 
## [311]    train-rmse:9.752558+0.113770    test-rmse:11.614758+0.305009 
## [312]    train-rmse:9.747884+0.112335    test-rmse:11.614217+0.306547 
## [313]    train-rmse:9.743179+0.109777    test-rmse:11.614103+0.307832 
## [314]    train-rmse:9.738652+0.108930    test-rmse:11.614955+0.308096 
## [315]    train-rmse:9.733948+0.108178    test-rmse:11.616372+0.307989 
## [316]    train-rmse:9.730620+0.108361    test-rmse:11.617003+0.307962 
## [317]    train-rmse:9.726570+0.109076    test-rmse:11.614043+0.308001 
## [318]    train-rmse:9.723382+0.110241    test-rmse:11.611849+0.307932 
## [319]    train-rmse:9.718988+0.110876    test-rmse:11.608956+0.307970 
## [320]    train-rmse:9.713847+0.109053    test-rmse:11.605468+0.306933 
## [321]    train-rmse:9.711410+0.109898    test-rmse:11.605565+0.307286 
## [322]    train-rmse:9.708915+0.108317    test-rmse:11.603922+0.309464 
## [323]    train-rmse:9.703443+0.108274    test-rmse:11.603704+0.313423 
## [324]    train-rmse:9.697856+0.107717    test-rmse:11.603635+0.314175 
## [325]    train-rmse:9.695111+0.109146    test-rmse:11.603186+0.315231 
## [326]    train-rmse:9.690672+0.108230    test-rmse:11.602044+0.317116 
## [327]    train-rmse:9.686129+0.107775    test-rmse:11.599761+0.315235 
## [328]    train-rmse:9.680981+0.106944    test-rmse:11.599537+0.312821 
## [329]    train-rmse:9.676653+0.107745    test-rmse:11.597144+0.311798 
## [330]    train-rmse:9.673532+0.108771    test-rmse:11.597773+0.312263 
## [331]    train-rmse:9.668215+0.109314    test-rmse:11.599049+0.310616 
## [332]    train-rmse:9.664005+0.107694    test-rmse:11.598246+0.309840 
## [333]    train-rmse:9.659995+0.107914    test-rmse:11.597746+0.311572 
## [334]    train-rmse:9.656654+0.106992    test-rmse:11.596983+0.306713 
## [335]    train-rmse:9.654323+0.108061    test-rmse:11.595509+0.307250 
## [336]    train-rmse:9.651270+0.106869    test-rmse:11.593700+0.306741 
## [337]    train-rmse:9.648515+0.107277    test-rmse:11.593453+0.309154 
## [338]    train-rmse:9.644154+0.107641    test-rmse:11.591023+0.310441 
## [339]    train-rmse:9.640006+0.106843    test-rmse:11.587395+0.310640 
## [340]    train-rmse:9.637847+0.107314    test-rmse:11.588431+0.309527 
## [341]    train-rmse:9.634123+0.108150    test-rmse:11.587590+0.311398 
## [342]    train-rmse:9.631540+0.107001    test-rmse:11.585591+0.312486 
## [343]    train-rmse:9.627174+0.108760    test-rmse:11.585592+0.313332 
## [344]    train-rmse:9.623977+0.108561    test-rmse:11.583761+0.312515 
## [345]    train-rmse:9.618891+0.108821    test-rmse:11.581696+0.312003 
## [346]    train-rmse:9.614607+0.108692    test-rmse:11.582797+0.311249 
## [347]    train-rmse:9.610552+0.108283    test-rmse:11.580989+0.312317 
## [348]    train-rmse:9.607445+0.109003    test-rmse:11.578476+0.313047 
## [349]    train-rmse:9.602383+0.109451    test-rmse:11.577757+0.312094 
## [350]    train-rmse:9.598573+0.109278    test-rmse:11.577160+0.311734 
## [351]    train-rmse:9.593730+0.108683    test-rmse:11.576241+0.309758 
## [352]    train-rmse:9.588358+0.104908    test-rmse:11.574972+0.307457 
## [353]    train-rmse:9.585886+0.103523    test-rmse:11.574300+0.307968 
## [354]    train-rmse:9.581291+0.103941    test-rmse:11.577074+0.305933 
## [355]    train-rmse:9.576187+0.102130    test-rmse:11.577054+0.306546 
## [356]    train-rmse:9.573779+0.102921    test-rmse:11.576134+0.307969 
## [357]    train-rmse:9.569282+0.104422    test-rmse:11.575895+0.306652 
## [358]    train-rmse:9.565969+0.105194    test-rmse:11.576764+0.306549 
## [359]    train-rmse:9.561743+0.104538    test-rmse:11.573963+0.304665 
## [360]    train-rmse:9.557883+0.101901    test-rmse:11.573424+0.304301 
## [361]    train-rmse:9.555766+0.102088    test-rmse:11.574548+0.305296 
## [362]    train-rmse:9.551908+0.103050    test-rmse:11.574672+0.302258 
## [363]    train-rmse:9.548805+0.102930    test-rmse:11.573079+0.302566 
## [364]    train-rmse:9.544446+0.104492    test-rmse:11.572657+0.303611 
## [365]    train-rmse:9.541485+0.104880    test-rmse:11.570791+0.302366 
## [366]    train-rmse:9.537356+0.105027    test-rmse:11.571097+0.302315 
## [367]    train-rmse:9.531125+0.108881    test-rmse:11.569984+0.301482 
## [368]    train-rmse:9.526500+0.110069    test-rmse:11.570358+0.302356 
## [369]    train-rmse:9.523448+0.109642    test-rmse:11.570159+0.301838 
## [370]    train-rmse:9.520043+0.108963    test-rmse:11.568514+0.304369 
## [371]    train-rmse:9.515778+0.110269    test-rmse:11.568075+0.304310 
## [372]    train-rmse:9.514139+0.110325    test-rmse:11.568313+0.305162 
## [373]    train-rmse:9.511214+0.110415    test-rmse:11.565174+0.304070 
## [374]    train-rmse:9.508784+0.110148    test-rmse:11.564073+0.304040 
## [375]    train-rmse:9.506206+0.109955    test-rmse:11.564084+0.304308 
## [376]    train-rmse:9.502621+0.110139    test-rmse:11.561988+0.304031 
## [377]    train-rmse:9.498560+0.111276    test-rmse:11.560506+0.304451 
## [378]    train-rmse:9.496073+0.111857    test-rmse:11.561769+0.305079 
## [379]    train-rmse:9.492458+0.111121    test-rmse:11.561537+0.306828 
## [380]    train-rmse:9.488880+0.109928    test-rmse:11.563527+0.303939 
## [381]    train-rmse:9.485588+0.110190    test-rmse:11.563078+0.303515 
## [382]    train-rmse:9.483100+0.109725    test-rmse:11.560601+0.304728 
## [383]    train-rmse:9.479337+0.110704    test-rmse:11.562949+0.306564 
## [384]    train-rmse:9.475900+0.110747    test-rmse:11.564237+0.305962 
## [385]    train-rmse:9.471268+0.110484    test-rmse:11.563445+0.305797 
## [386]    train-rmse:9.468912+0.109136    test-rmse:11.563232+0.305705 
## [387]    train-rmse:9.466473+0.109195    test-rmse:11.562778+0.305902 
## Stopping. Best iteration:
## [377]    train-rmse:9.498560+0.111276    test-rmse:11.560506+0.304451
best_nrounds <- cv_results$best_iteration
# Train the final model using the best number of rounds found
model_xgb <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb, dtrain)
test_pred <- predict(model_xgb, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)

# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)

train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
    "--------------------------\n",
    "Training RMSE: ", train_rmse, "\n",
    "Test RMSE: ", test_rmse, "\n",
    "Training R-squared: ", r_squared_train, "\n",
    "Test R-squared: ", r_squared_test, "\n",
    "Training MAE: ", train_mae, "\n",
    "Test MAE: ", test_mae, "\n",
    "Training MAPE: ", train_mape, "%\n",
    "Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 9.726129
## Test RMSE: 11.42459
## Training R-squared: 0.9812686
## Test R-squared: 0.9750236
## Training MAE: 6.503416
## Test MAE: 7.239086
## Training MAPE: 20.10542%
## Test MAPE: 23.55032%
# Correcting Residuals Data Frame
# Assuming 'train_labels' and 'test_labels' contain the actual values,
# and 'train_pred' and 'test_pred' contain your model's predictions:

residuals_train <- train_labels - train_pred
residuals_test <- test_labels - test_pred

residuals_data <- data.frame(
  Residuals = c(residuals_train, residuals_test),
  Dataset = c(rep('Training', length(residuals_train)), rep('Test', length(residuals_test)))
)

# Now plotting residuals with corrected data
ggplot(residuals_data, aes(x = Residuals, fill = Dataset)) +
  geom_histogram(binwidth = 1, position = 'identity', alpha = 0.6) +
  facet_wrap(~ Dataset) +
  ggtitle('Residuals Distribution')

# Assuming train_labels, test_labels, train_pred, and test_pred are correctly defined

# Adjusted Actual vs. Predicted Data Preparation
actual_pred_data <- data.frame(
  Actual = c(train_labels, test_labels),
  Predicted = c(train_pred, test_pred),
  Dataset = c(rep('Training', length(train_labels)), rep('Test', length(test_labels)))
)

# Plotting Actual vs. Predicted Values
ggplot(actual_pred_data, aes(x = Actual, y = Predicted, colour = Dataset)) +
  geom_point(alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'red') +
  xlab('Actual Values') +
  ylab('Predicted Values') +
  scale_colour_manual(values = c('Training' = 'blue', 'Test' = 'red')) +
  ggtitle('Actual vs. Predicted Values')

library(xgboost)

# Calculate feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_features), model = model_xgb)

# View the feature importance scores
print(importance_matrix)
##                                          Feature         Gain        Cover
##  1:                                 DOLLAR_SALES 7.492052e-01 0.4515138155
##  2:                                      RESERVE 9.069002e-02 0.0647068095
##  3:                                        WHITE 5.722103e-02 0.0265604428
##  4:                      BRAND_MYTHICAL.BEVERAGE 4.360819e-02 0.0128073188
##  5:                                 WEEK_OF_YEAR 8.122528e-03 0.1212281935
##  6:                               REGION_PRAIRIE 7.327083e-03 0.0178425217
##  7:                              REGION_COLORADO 7.013103e-03 0.0145408679
##  8:                              REGION_MOUNTAIN 4.925610e-03 0.0223640637
##  9:                              REGION_NORTHERN 4.844435e-03 0.0276914310
## 10:                                       SUNSET 4.108095e-03 0.0136490055
## 11:                              CALORIC_SEGMENT 3.878993e-03 0.0273677749
## 12:                                 REGION_NOCAL 2.564525e-03 0.0142378547
## 13:                               REGION_ARIZONA 2.141366e-03 0.0186326765
## 14: ITEM_MYTHICAL.BEVERAGE.RESERVE..WHITE.CASAVA 1.976363e-03 0.0024233524
## 15:                                       CASAVA 1.922049e-03 0.0090394677
## 16:                           BRAND_DIET.MOONLIT 1.842857e-03 0.0048839223
## 17:                      NO_ARTIFICIAL_SWEETNERS 1.445095e-03 0.0160074158
## 18:                                 REGION_SOCAL 1.022467e-03 0.0083460480
## 19:                                        GUAVA 9.304661e-04 0.0117596555
## 20:                                REGION_KANSAS 8.220668e-04 0.0132108342
## 21:                             REGION_NEWMEXICO 7.997970e-04 0.0224282525
## 22:                             REGION_DESERT_SW 6.937068e-04 0.0150799934
## 23:                     BRAND_SUPER.DUPER.JUICED 6.924043e-04 0.0056749812
## 24:                                     RECOVERY 4.225686e-04 0.0027870888
## 25:                                 ENERGY_DRINK 3.839939e-04 0.0022276218
## 26:                           REGION_CALI_NEVADA 2.648522e-04 0.0135579961
## 27:                           X16SMALL.MULTI.CUP 2.439811e-04 0.0060867838
## 28:            ITEM_SUPER.DUPER.PITAYA...CASAVA. 2.349725e-04 0.0005249618
## 29:                                       PITAYA 2.272659e-04 0.0081366058
## 30:   ITEM_SUPER.DUPER.JUIC..CASAVA.SUNSET.GUAVA 1.105918e-04 0.0027785002
## 31:                                         JACK 8.119516e-05 0.0036155158
## 32:                    BRAND_SUPER.DUPER.PUNCHED 7.506285e-05 0.0052417823
## 33:                                BRAND_MOONLIT 6.332375e-05 0.0071088319
## 34:                                X2L.MULTI.JUG 3.444163e-05 0.0010898531
## 35:                   BRAND_SUPER.DUPER.RECOVERY 3.389506e-05 0.0011487682
## 36:                         ITEM_MOONLIT..CASAVA 1.471366e-05 0.0028583595
## 37:                         ITEM_MOONLIT..SUNSET 1.170011e-05 0.0008406319
##                                          Feature         Gain        Cover
##        Frequency
##  1: 0.3458882611
##  2: 0.0539861896
##  3: 0.0253191044
##  4: 0.0108809374
##  5: 0.1895794099
##  6: 0.0238543628
##  7: 0.0282485876
##  8: 0.0301318267
##  9: 0.0307595731
## 10: 0.0182046453
## 11: 0.0278300900
## 12: 0.0171584013
## 13: 0.0288763340
## 14: 0.0018832392
## 15: 0.0058589663
## 16: 0.0041849759
## 17: 0.0152751622
## 18: 0.0131826742
## 19: 0.0142289182
## 20: 0.0133919230
## 21: 0.0161121574
## 22: 0.0125549278
## 23: 0.0069052103
## 24: 0.0094161959
## 25: 0.0096254447
## 26: 0.0106716886
## 27: 0.0050219711
## 28: 0.0008369952
## 29: 0.0050219711
## 30: 0.0039757271
## 31: 0.0056497175
## 32: 0.0037664783
## 33: 0.0054404687
## 34: 0.0010462440
## 35: 0.0023017368
## 36: 0.0023017368
## 37: 0.0006277464
##        Frequency
# Plot the feature importance
xgb.plot.importance(importance_matrix = importance_matrix)

# Compute partial dependence data for 'DOLLAR_SALES' and 'CASAVA', CALORIC_SEGMENT, and "ENERGY
# pd <- partial(model_xgb, pred.var = c("DOLLAR_SALES", "CASAVA", "CALORIC_SEGMENT", ENERGY"), train = train_features, grid.resolution = 20)
# 
# # Default PDP
# pdp1 <- plotPartial(pd, plot = TRUE)
# 
# # Add contour lines and use a different color palette
# rwb <- colorRampPalette(c("red", "white", "blue"))
# pdp2 <- plotPartial(pd, contour = TRUE, col.regions = rwb)
# 
# # 3-D surface
# pdp3 <- plotPartial(pd, levelplot = FALSE, zlab = "Predicted Outcome", drape = TRUE, colorkey = TRUE, screen = list(z = -20, x = -60))
# 
# # Combine plots into one window
# grid.arrange(pdp1, pdp2, pdp3, ncol = 3)

XGBOOST Model #2

Model with NO DOLLAR SALES Variable

# Assuming 'df' is your complete dataframe and 'UNIT_SALES' is your target variable
df2 <- df
# Remove DOLLAR_SALES from the features
df2$DOLLAR_SALES <- NULL

# Split the updated data into training and testing sets (assuming you're using a similar approach as before)
set.seed(123)
df2_testtrn <- initial_split(df2, prop = 0.8, strata = UNIT_SALES)
Train <- training(df2_testtrn)
Test <- testing(df2_testtrn)

# Prepare features and labels for XGBoost, excluding DOLLAR_SALES
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES

# Convert data to DMatrix format for XGBoost
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Assuming 'params' and 'best_nrounds' are defined as before

# Train the final model without DOLLAR_SALES
model_xgb_no_dollar_sales <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb_no_dollar_sales, dtrain)
test_pred <- predict(model_xgb_no_dollar_sales, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)

# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)

train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
    "--------------------------\n",
    "Training RMSE: ", train_rmse, "\n",
    "Test RMSE: ", test_rmse, "\n",
    "Training R-squared: ", r_squared_train, "\n",
    "Test R-squared: ", r_squared_test, "\n",
    "Training MAE: ", train_mae, "\n",
    "Test MAE: ", test_mae, "\n",
    "Training MAPE: ", train_mape, "%\n",
    "Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 51.03556
## Test RMSE: 57.0461
## Training R-squared: 0.4842549
## Test R-squared: 0.3772679
## Training MAE: 34.92517
## Test MAE: 37.98006
## Training MAPE: 228.2699%
## Test MAPE: 274.0136%
# Calculate feature importance
importance_matrix2 <- xgb.importance(feature_names = colnames(train_features), model = model_xgb_no_dollar_sales)

# View the feature importance scores
print(importance_matrix2)
##                                          Feature         Gain        Cover
##  1:                                      RESERVE 0.3454217961 0.0694608014
##  2:                                 WEEK_OF_YEAR 0.1207836505 0.2990349831
##  3:                                        WHITE 0.0926169672 0.0233899482
##  4:                              REGION_COLORADO 0.0534639570 0.0365789726
##  5:                      BRAND_MYTHICAL.BEVERAGE 0.0422554479 0.0173343757
##  6:                      NO_ARTIFICIAL_SWEETNERS 0.0385900817 0.0202321984
##  7:                               REGION_PRAIRIE 0.0342124484 0.0430368565
##  8:                              REGION_MOUNTAIN 0.0340366041 0.0564988345
##  9:                              REGION_NORTHERN 0.0300446682 0.0421604359
## 10:                                       SUNSET 0.0250295152 0.0164420504
## 11:                                     RECOVERY 0.0214300745 0.0150953073
## 12:                              CALORIC_SEGMENT 0.0198704988 0.0232215107
## 13:                                 REGION_NOCAL 0.0139207152 0.0264419702
## 14:                               REGION_ARIZONA 0.0127081104 0.0267797541
## 15:                                REGION_KANSAS 0.0123352008 0.0464927044
## 16: ITEM_MYTHICAL.BEVERAGE.RESERVE..WHITE.CASAVA 0.0122444139 0.0021805999
## 17:                           X16SMALL.MULTI.CUP 0.0113472983 0.0121597679
## 18:                                 REGION_SOCAL 0.0093423100 0.0239332806
## 19:                   BRAND_SUPER.DUPER.RECOVERY 0.0081789856 0.0029252393
## 20:                                 ENERGY_DRINK 0.0075478920 0.0088034375
## 21:                                       CASAVA 0.0072047868 0.0082625287
## 22:                             REGION_DESERT_SW 0.0067948773 0.0371759262
## 23:                                        GUAVA 0.0067516507 0.0232410507
## 24:                             REGION_NEWMEXICO 0.0063191294 0.0323737896
## 25:                           REGION_CALI_NEVADA 0.0055061894 0.0232621054
## 26:                                       PITAYA 0.0042657907 0.0096262368
## 27:                           BRAND_DIET.MOONLIT 0.0042185355 0.0033749615
## 28:                     BRAND_SUPER.DUPER.JUICED 0.0035660876 0.0127109768
## 29:                         ITEM_MOONLIT..SUNSET 0.0025073756 0.0026621314
## 30:                                         JACK 0.0022992594 0.0084698947
## 31:            ITEM_SUPER.DUPER.PITAYA...CASAVA. 0.0010174978 0.0008600616
## 32:                    BRAND_SUPER.DUPER.PUNCHED 0.0009445124 0.0029850710
## 33:   ITEM_SUPER.DUPER.JUIC..CASAVA.SUNSET.GUAVA 0.0009097770 0.0054184787
## 34:                                BRAND_MOONLIT 0.0008045329 0.0087837460
## 35:                                X2L.MULTI.JUG 0.0007361638 0.0019555116
## 36:                         ITEM_MOONLIT..CASAVA 0.0006138386 0.0062449134
## 37:       ITEM_SUPER.DUPER.RECOVERY..CASAVA.JACK 0.0001593594 0.0003895876
##                                          Feature         Gain        Cover
##       Frequency
##  1: 0.055235658
##  2: 0.308381318
##  3: 0.020473448
##  4: 0.036468330
##  5: 0.015568351
##  6: 0.026444871
##  7: 0.030070377
##  8: 0.048197910
##  9: 0.046918319
## 10: 0.021113244
## 11: 0.030070377
## 12: 0.031349968
## 13: 0.027084666
## 14: 0.031776498
## 15: 0.030496908
## 16: 0.001706121
## 17: 0.012156110
## 18: 0.029857112
## 19: 0.004905097
## 20: 0.018767328
## 21: 0.009810194
## 22: 0.026658136
## 23: 0.025591811
## 24: 0.026658136
## 25: 0.017274472
## 26: 0.014075496
## 27: 0.002985711
## 28: 0.013222435
## 29: 0.002132651
## 30: 0.011942845
## 31: 0.002559181
## 32: 0.003198976
## 33: 0.004691832
## 34: 0.004265302
## 35: 0.003838772
## 36: 0.002559181
## 37: 0.001492856
##       Frequency
xgb.plot.importance(importance_matrix = importance_matrix2)

if (!requireNamespace("pdp", quietly = TRUE)) install.packages("pdp")
if (!requireNamespace("xgboost", quietly = TRUE)) install.packages("xgboost")
library(pdp)
library(xgboost)
pdp::partial(model_xgb_no_dollar_sales, pred.var = "WEEK_OF_YEAR", train = train_features)
##    WEEK_OF_YEAR     yhat
## 1          1.00 76.95211
## 2          2.04 71.71872
## 3          3.08 69.83707
## 4          4.12 66.32345
## 5          5.16 60.81318
## 6          6.20 70.90967
## 7          7.24 68.43738
## 8          8.28 63.29930
## 9          9.32 65.10666
## 10        10.36 67.16075
## 11        11.40 65.23766
## 12        12.44 60.54461
## 13        13.48 62.33341
## 14        14.52 62.61749
## 15        15.56 63.04057
## 16        16.60 62.84779
## 17        17.64 65.46582
## 18        18.68 71.05545
## 19        19.72 74.71198
## 20        20.76 76.75920
## 21        21.80 76.67443
## 22        22.84 74.15679
## 23        23.88 72.66046
## 24        24.92 76.08584
## 25        25.96 76.45128
## 26        27.00 72.19917
## 27        28.04 65.85761
## 28        29.08 70.96944
## 29        30.12 64.56317
## 30        31.16 67.39512
## 31        32.20 78.84320
## 32        33.24 79.74898
## 33        34.28 79.04015
## 34        35.32 78.92823
## 35        36.36 77.01705
## 36        37.40 76.63151
## 37        38.44 74.05521
## 38        39.48 68.70417
## 39        40.52 54.57311
## 40        41.56 58.36970
## 41        42.60 57.34105
## 42        43.64 61.53831
## 43        44.68 65.07709
## 44        45.72 66.35301
## 45        46.76 70.20591
## 46        47.80 65.56802
## 47        48.84 64.82494
## 48        49.88 70.71167
## 49        50.92 67.44957
## 50        51.96 77.38542
## 51        53.00 62.14893
pd <- partial(model_xgb_no_dollar_sales, pred.var = "WEEK_OF_YEAR", train = train_features, grid.resolution = 20)

# Default PDP
pdp1 <- plotPartial(pd, plot = TRUE)

# plot
grid.arrange(pdp1)

Based on the Casava Energy Drink 2L MULTI JUG innovation datafram we expect the best 6 months to be between about weeks 17 and weeks 38.

rm(list = ls())

Innovation Model 3 - Kiwano Venomous Blast

Linear Regression Review

#skim(df) #same initial dataset used in previous models
# create a table of total values by brand
brand_summary <- df %>%
  group_by(BRAND) %>%
  summarise(
    total_units_sold = sum(UNIT_SALES),
    total_revenue = sum(DOLLAR_SALES),
    avg_price = total_revenue / total_units_sold,
    total_days_sold = n() # Count the number of rows for each brand
  ) %>%
  arrange(desc(total_revenue)) %>%  # Order by revenue in descending order
  mutate(rank = row_number()) 

summary(brand_summary)
##     BRAND           total_units_sold   total_revenue         avg_price      
##  Length:288         Min.   :       1   Min.   :        1   Min.   : 0.5315  
##  Class :character   1st Qu.:    2310   1st Qu.:     7563   1st Qu.: 2.0861  
##  Mode  :character   Median :   94691   Median :   266075   Median : 3.0291  
##                     Mean   : 1473003   Mean   :  4989427   Mean   : 3.2661  
##                     3rd Qu.:  651385   3rd Qu.:  2161764   3rd Qu.: 3.7252  
##                     Max.   :40414038   Max.   :159387186   Max.   :42.9378  
##  total_days_sold         rank       
##  Min.   :     1.0   Min.   :  1.00  
##  1st Qu.:   121.8   1st Qu.: 72.75  
##  Median :  1988.0   Median :144.50  
##  Mean   :  8493.5   Mean   :144.50  
##  3rd Qu.:  8075.8   3rd Qu.:216.25  
##  Max.   :124603.0   Max.   :288.00
print(brand_summary[brand_summary$BRAND == "VENOMOUS BLAST", ])
## # A tibble: 1 × 6
##   BRAND          total_units_sold total_revenue avg_price total_days_sold  rank
##   <chr>                     <dbl>         <dbl>     <dbl>           <int> <int>
## 1 VENOMOUS BLAST           360173       361370.      1.00            5188   130

VENOMOUS BLAST does have a decent amount of sales ranking 130 of 288 in total revenue. They surprisingly have a low average price and a low total days sold.

# Filter the dataframe for only 'Venomous Blast'
filtered_df <- df %>% 
  filter(BRAND == "VENOMOUS BLAST")

summary(filtered_df)
##   MARKET_KEY            DATE           CALORIC_SEGMENT    CATEGORY        
##  Length:5188        Length:5188        Min.   :0.0000   Length:5188       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :1.0000   Mode  :character  
##                                        Mean   :0.7406                     
##                                        3rd Qu.:1.0000                     
##                                        Max.   :1.0000                     
##    UNIT_SALES       DOLLAR_SALES     MANUFACTURER          BRAND          
##  Min.   :   1.00   Min.   :   0.50   Length:5188        Length:5188       
##  1st Qu.:   6.00   1st Qu.:   5.92   Class :character   Class :character  
##  Median :  16.00   Median :  16.64   Mode  :character   Mode  :character  
##  Mean   :  69.42   Mean   :  69.66                                        
##  3rd Qu.:  41.00   3rd Qu.:  42.20                                        
##  Max.   :3298.00   Max.   :3199.67                                        
##    PACKAGE              ITEM              REGION              MONTH       
##  Length:5188        Length:5188        Length:5188        Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 6.174  
##                                                           3rd Qu.: 9.000  
##                                                           Max.   :12.000  
##     SEASON         
##  Length:5188       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Create the plot
ggplot(filtered_df, aes(x = UNIT_SALES, y = DOLLAR_SALES)) +
  geom_point(color = "red", alpha = 1) +  # Bright red points with full opacity
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Add linear regression line without confidence band
  labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES for VENOMOUS BLAST",
       x = "UNIT SALES",
       y = "DOLLAR SALES") +
  theme_minimal() +
  theme(legend.position = "none")  
## `geom_smooth()` using formula = 'y ~ x'

filtered_df %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  ggplot(aes(x = WEEK, y = total_sales)) +
  geom_line(color = "black") +  # Blue line connecting points
  labs(title = "Total Sales by Week of the Year",
       x = "Week of the Year",
       y = "Total Unit Sales") +
  theme_minimal()

> This shows that sales by week of the year of VENOMOUS BLAST is very spread out

#find the best 13 weeks
library(zoo)
# Calculate total sales for each group of 13 consecutive weeks
sales_by_group <- filtered_df %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_group$week_label <- factor(sales_by_group$week_label, levels = sales_by_group$week_label[order(sales_by_group$WEEK)])
ggplot(sales_by_group, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 13-Week Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

> From this graph we see that weeks 24 to 36 historically have the highest unit sales of VENOMOUS BLAST

#find the best 13 weeks for Kiwano sales
# Calculate total sales for each group of 13 consecutive weeks
sales_by_kiwano <- df %>%
  filter(str_detect(ITEM, "KIWANO")) %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_kiwano$week_label <- factor(sales_by_kiwano$week_label, levels = sales_by_kiwano$week_label[order(sales_by_kiwano$WEEK)])
ggplot(sales_by_kiwano, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 13-Week Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

>This graph shows the best weeks sales of any kiwano drink is week 19 to 31.

#find the best 13 weeks for Kiwano sales
# Calculate total sales for each group of 13 consecutive weeks
sales_by_energy <- df %>%
  filter(CATEGORY == "ENERGY",
         str_detect(ITEM, "KIWANO"),
         str_detect(PACKAGE, "16")) %>%
  mutate(DATE = as.Date(DATE)) %>%
  mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
  group_by(WEEK) %>%
  summarise(total_sales = sum(UNIT_SALES)) %>%
  mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
  mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
  arrange(WEEK) %>%  # Order by WEEK
  filter(!is.na(sales_in_group))  # Remove rows with sales_in_group = NA

# Plot the bar chart
sales_by_energy$week_label <- factor(sales_by_energy$week_label, levels = sales_by_energy$week_label[order(sales_by_energy$WEEK)])
ggplot(sales_by_energy, aes(x = factor(week_label), y = sales_in_group)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Total Sales for Each 13-Week Grouping",
       x = "Weeks (Starting from Week 1)",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

>This graph shows the best weeks for sales of Energy drinks with Kiwano flavors and packageing 16 is weeks 12 to 24

innovation <- df %>% 
  filter(CATEGORY == "ENERGY",
         CALORIC_SEGMENT == 0,
         str_detect(ITEM, "KIWANO"),
         str_detect(PACKAGE, "16"))

print(unique(innovation$ITEM))
##  [1] "MYTHICAL BEVERAGE ULTRA KIWANO  ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL X4"           
##  [2] "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO  KEKE  SUGAR FREE CUP 16 LIQUID SMALL"                     
##  [3] "RAINING JUMPIN-FISH GAME FUEL ZERO ENERGY DRINK CHARGED KIWANO  SHOCK ZERO SUGAR CUP 16 LIQUID SMALL"
##  [4] "MYTHICAL BEVERAGE ULTRA KIWANO  ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL"              
##  [5] "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO  ZERO SUGAR CUP 16 LIQUID SMALL"                           
##  [6] "MYTHICAL BEVERAGE ULTRA KIWANO  ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL X24"          
##  [7] "VENOMOUS BLAST ENERGY DRINK KIWANO  DURIAN  CUP 16 LIQUID SMALL"                                     
##  [8] "POW-POW GENTLE DRINK WYLDIN KIWANO  CUP 16 LIQUID SMALL X12"                                         
##  [9] "MYTHICAL BEVERAGE REHAB ENERGY DRINK KIWANO  CUP 15.5 LIQUID SMALL X24"                              
## [10] "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO  ZERO SUGAR CUP 16 LIQUID SMALL X24"
#there are 10 items with energy, diet, kiwano that come in packs of 16, but none of them are from VENOMOUS BLAST. 



library(dplyr)
library(lubridate)

innovation <- innovation %>%
  mutate(
    MONTH = month(ymd(DATE)),  # Extract month using lubridate's ymd function
    MONTH = as.factor(MONTH)   # Convert the extracted month into a factor
  )

str(innovation)
## 'data.frame':    8082 obs. of  13 variables:
##  $ MARKET_KEY     : chr  "504" "953" "133" "817" ...
##  $ DATE           : chr  "2022-02-26" "2022-08-20" "2020-12-19" "2022-02-05" ...
##  $ CALORIC_SEGMENT: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CATEGORY       : chr  "ENERGY" "ENERGY" "ENERGY" "ENERGY" ...
##  $ UNIT_SALES     : num  11 13 20 194 8 176 87 300 4 102 ...
##  $ DOLLAR_SALES   : num  78.9 21.8 40.5 287.1 63.4 ...
##  $ MANUFACTURER   : chr  "PONYS" "JOLLYS" "JOLLYS" "JOLLYS" ...
##  $ BRAND          : chr  "MYTHICAL BEVERAGE ULTRA" "SUPER-DUPER PURE ZERO" "HILL MOISTURE JUMPIN-FISH" "SUPER-DUPER PURE ZERO" ...
##  $ PACKAGE        : chr  "16SMALL 4ONE CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" ...
##  $ ITEM           : chr  "MYTHICAL BEVERAGE ULTRA KIWANO  ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL X4" "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO  KEKE  SUGAR FREE CUP 16 LIQUID SMALL" "RAINING JUMPIN-FISH GAME FUEL ZERO ENERGY DRINK CHARGED KIWANO  SHOCK ZERO SUGAR CUP 16 LIQUID SMALL" "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO  KEKE  SUGAR FREE CUP 16 LIQUID SMALL" ...
##  $ REGION         : chr  "NORTHERN" "ARIZONA" "MOUNTAIN" "COLORADO" ...
##  $ MONTH          : Factor w/ 12 levels "1","2","3","4",..: 2 8 12 2 5 5 10 8 5 8 ...
##  $ SEASON         : chr  "WINTER" "SUMMER" "WINTER" "WINTER" ...
# Assuming 'innovation' is your data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
## 
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + 
##     SEASON + REGION, data = innovation)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -953.6  -35.4   -1.1   27.6 5847.8 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -3.571915  18.407162  -0.194 0.846141    
## UNIT_SALES                 2.179578   0.004342 501.957  < 2e-16 ***
## CALORIC_SEGMENT                  NA         NA      NA       NA    
## PACKAGE16SMALL 24ONE CUP 178.333322  19.021309   9.375  < 2e-16 ***
## PACKAGE16SMALL 4ONE CUP   62.481782  18.270952   3.420 0.000630 ***
## PACKAGE16SMALL MULTI CUP  -9.985916  17.796875  -0.561 0.574742    
## SEASONSPRING               7.749111   5.278459   1.468 0.142126    
## SEASONSUMMER               0.158127   5.606383   0.028 0.977500    
## SEASONWINTER              -5.957836   5.296196  -1.125 0.260653    
## REGIONCALI_NEVADA         -6.656448  10.258577  -0.649 0.516443    
## REGIONCOLORADO            19.756980   6.669432   2.962 0.003062 ** 
## REGIONDESERT_SW            0.165096   7.867662   0.021 0.983259    
## REGIONKANSAS             170.758804  14.371366  11.882  < 2e-16 ***
## REGIONMOUNTAIN            -0.897751   7.130829  -0.126 0.899816    
## REGIONNEWMEXICO           15.665066   9.744594   1.608 0.107970    
## REGIONNOCAL              -18.993746   9.856390  -1.927 0.054009 .  
## REGIONNORTHERN            -6.707730   5.438244  -1.233 0.217449    
## REGIONPRAIRIE             40.817170  11.563916   3.530 0.000418 ***
## REGIONSOCAL              -14.067039   7.502472  -1.875 0.060831 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 164.2 on 8064 degrees of freedom
## Multiple R-squared:  0.975,  Adjusted R-squared:  0.975 
## F-statistic: 1.852e+04 on 17 and 8064 DF,  p-value: < 2.2e-16
library(dplyr)

small_group <- df %>%
  filter(UNIT_SALES < 3300, DOLLAR_SALES < 3200)

skim(small_group)
Data summary
Name small_group
Number of rows 2372840
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MARKET_KEY 0 1 1 4 0 200 0
DATE 0 1 10 10 0 152 0
CATEGORY 0 1 3 18 0 5 0
MANUFACTURER 0 1 5 8 0 8 0
BRAND 0 1 4 56 0 288 0
PACKAGE 0 1 11 26 0 95 0
ITEM 0 1 26 142 0 2999 0
REGION 0 1 5 11 0 11 0
SEASON 0 1 4 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.50 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
UNIT_SALES 0 1 104.83 183.49 0.04 10.00 38.00 113.00 3298.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 332.69 516.65 0.01 34.92 126.27 380.55 3199.98 ▇▁▁▁▁
MONTH 0 1 6.28 3.44 1.00 3.00 6.00 9.00 12.00 ▇▆▆▅▇
# Create a scatter plot with the regression line, colored by MANUFACTURER
ggplot(small_group, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
  geom_point(alpha = 0.5) +  # Adjust alpha to avoid overplotting, if necessary
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Add linear regression line without confidence band for clarity
  labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
       x = "UNTI SALES",
       y = "DOLLAR SALES") +
  theme_minimal() +
  theme(legend.position = "bottom")  # Adjust legend position if needed
## `geom_smooth()` using formula = 'y ~ x'

kiwano_small <- df[grep("kiwano", df$ITEM, ignore.case = TRUE), ]
skim(kiwano_small)
Data summary
Name kiwano_small
Number of rows 71256
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MARKET_KEY 0 1 1 4 0 200 0
DATE 0 1 10 10 0 152 0
CATEGORY 0 1 3 18 0 4 0
MANUFACTURER 0 1 5 8 0 7 0
BRAND 0 1 5 41 0 27 0
PACKAGE 0 1 12 23 0 28 0
ITEM 0 1 46 105 0 68 0
REGION 0 1 5 11 0 11 0
SEASON 0 1 4 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CALORIC_SEGMENT 0 1 0.34 0.48 0.00 0.00 0.0 1.00 1.00 ▇▁▁▁▅
UNIT_SALES 0 1 101.93 384.04 0.50 8.00 26.0 76.00 16851.00 ▇▁▁▁▁
DOLLAR_SALES 0 1 280.74 1016.57 0.01 28.25 86.8 221.68 45991.65 ▇▁▁▁▁
MONTH 0 1 6.32 3.44 1.00 3.00 6.0 9.00 12.00 ▇▆▆▅▇
# Assuming 'innovation' is your data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + CATEGORY + SEASON + REGION, data = kiwano_small)
summary(model)
## 
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + 
##     CATEGORY + SEASON + REGION, data = kiwano_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4689.1   -40.8    -5.9    38.3  6971.0 
## 
## Coefficients:
##                                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                     1.393e+02  1.103e+01   12.625  < 2e-16 ***
## UNIT_SALES                      2.568e+00  1.992e-03 1288.714  < 2e-16 ***
## CALORIC_SEGMENT                 1.117e+02  2.414e+00   46.279  < 2e-16 ***
## PACKAGE.5L 6ONE JUG            -8.551e+01  6.049e+00  -14.137  < 2e-16 ***
## PACKAGE.5L MULTI JUG           -1.233e+02  1.111e+01  -11.092  < 2e-16 ***
## PACKAGE1.25L MULTI JUG         -3.811e+02  2.084e+01  -18.287  < 2e-16 ***
## PACKAGE12SMALL 12ONE CUP       -8.777e+01  9.668e+00   -9.079  < 2e-16 ***
## PACKAGE12SMALL 24ONE CUP       -2.348e+02  1.484e+01  -15.825  < 2e-16 ***
## PACKAGE12SMALL 8ONE CUP        -8.347e+01  1.062e+01   -7.857 4.00e-15 ***
## PACKAGE12SMALL MLT BUMPY CUP   -1.712e+02  7.514e+00  -22.790  < 2e-16 ***
## PACKAGE12SMALL MLT MEDIUM CUP  -1.833e+02  4.247e+01   -4.317 1.58e-05 ***
## PACKAGE12SMALL MULTI CUP       -1.020e+02  1.107e+01   -9.209  < 2e-16 ***
## PACKAGE16SMALL 12ONE CUP       -1.417e+02  2.247e+01   -6.308 2.85e-10 ***
## PACKAGE16SMALL 24ONE CUP        3.268e+01  1.338e+01    2.442 0.014617 *  
## PACKAGE16SMALL 4ONE CUP        -8.254e+01  1.206e+01   -6.842 7.86e-12 ***
## PACKAGE16SMALL MULTI CUP       -2.411e+02  1.079e+01  -22.335  < 2e-16 ***
## PACKAGE18SMALL 6ONE            -3.718e+01  5.306e+00   -7.007 2.46e-12 ***
## PACKAGE18SMALL MULTI JUG       -1.642e+02  4.463e+00  -36.784  < 2e-16 ***
## PACKAGE1L MULTI JUG            -1.883e+02  2.070e+01   -9.096  < 2e-16 ***
## PACKAGE20SMALL 12ONE JUG       -2.342e+02  6.947e+01   -3.371 0.000749 ***
## PACKAGE20SMALL MULTI JUG       -2.175e+02  4.431e+00  -49.091  < 2e-16 ***
## PACKAGE24 - 25SMALL MULTI JUG  -1.819e+02  5.587e+00  -32.557  < 2e-16 ***
## PACKAGE24SMALL MLT SHADYES JUG -1.985e+02  4.364e+01   -4.548 5.42e-06 ***
## PACKAGE2L MULTI JUG            -2.168e+02  7.749e+00  -27.981  < 2e-16 ***
## PACKAGE7.5SMALL 10ONE          -1.313e+02  1.064e+02   -1.234 0.217097    
## PACKAGE8SMALL 12ONE CUP        -3.811e+00  1.199e+01   -0.318 0.750544    
## PACKAGE8SMALL 24ONE CUP        -2.551e+02  2.122e+01  -12.022  < 2e-16 ***
## PACKAGE8SMALL 4ONE CUP         -1.213e+02  1.151e+01  -10.540  < 2e-16 ***
## PACKAGE8SMALL MULTI CUP        -3.477e+02  1.148e+01  -30.292  < 2e-16 ***
## PACKAGEALL OTHER ONES          -2.758e+01  1.107e+01   -2.490 0.012769 *  
## CATEGORYING ENHANCED WATER     -1.694e+01  1.022e+01   -1.658 0.097326 .  
## CATEGORYSPARKLING WATER        -3.909e+00  3.442e+00   -1.136 0.256005    
## CATEGORYSSD                    -6.247e+01  9.538e+00   -6.550 5.80e-11 ***
## SEASONSPRING                   -1.027e+01  1.944e+00   -5.282 1.28e-07 ***
## SEASONSUMMER                   -1.143e+01  2.017e+00   -5.667 1.46e-08 ***
## SEASONWINTER                   -7.458e+00  1.976e+00   -3.774 0.000161 ***
## REGIONCALI_NEVADA               4.919e+00  4.000e+00    1.230 0.218800    
## REGIONCOLORADO                  1.046e+01  2.480e+00    4.220 2.45e-05 ***
## REGIONDESERT_SW                 2.392e+00  2.934e+00    0.815 0.414922    
## REGIONKANSAS                    1.757e+02  5.232e+00   33.571  < 2e-16 ***
## REGIONMOUNTAIN                  1.032e+01  2.691e+00    3.835 0.000126 ***
## REGIONNEWMEXICO                 1.735e+01  3.511e+00    4.943 7.71e-07 ***
## REGIONNOCAL                     2.368e+00  3.761e+00    0.630 0.528893    
## REGIONNORTHERN                  8.657e+00  2.026e+00    4.273 1.93e-05 ***
## REGIONPRAIRIE                   2.289e+01  4.318e+00    5.301 1.15e-07 ***
## REGIONSOCAL                     1.705e+00  2.842e+00    0.600 0.548662    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 183.4 on 71210 degrees of freedom
## Multiple R-squared:  0.9675, Adjusted R-squared:  0.9675 
## F-statistic: 4.707e+04 on 45 and 71210 DF,  p-value: < 2.2e-16

Though Kiwano and energy drinks have very few rows. I do think there is potential here to find a good fitting model that can predict launch sales. I am thinking that if we can get a model that will predict the sales of uints of energy drinks, with size 16, and kiwano flavor we can then use that combined with the current sales rate of VENOMUS BLAST launches to get an accurate forecast. As far as selection of what weeks would be best to sell I don’t see any other way than by using historical best 13 weeks sales of either Venmous Blast, energy drinks, or kiwano flavored drinks.

XGBOOST

# Load and prepare dataset
df1 <- read.csv("one_hot_kiwano.csv") 
df1 <- df1 %>% 
  select(-DATE, -MONTH, -WINTER, -SPRING, -FALL, -DOLLAR_SALES, -SUMMER)
# Summarize the dataset
skimr::skim(df1)
Data summary
Name df1
Number of rows 8082
Number of columns 31
_______________________
Column type frequency:
numeric 31
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
UNIT_SALES 0 1 171.20 468.61 0.5 10 66 214 10621 ▇▁▁▁▁
NORTHERN 0 1 0.25 0.43 0.0 0 0 0 1 ▇▁▁▁▂
ARIZONA 0 1 0.21 0.41 0.0 0 0 0 1 ▇▁▁▁▂
MOUNTAIN 0 1 0.10 0.29 0.0 0 0 0 1 ▇▁▁▁▁
COLORADO 0 1 0.12 0.32 0.0 0 0 0 1 ▇▁▁▁▁
DESERT_SW 0 1 0.07 0.26 0.0 0 0 0 1 ▇▁▁▁▁
NOCAL 0 1 0.04 0.20 0.0 0 0 0 1 ▇▁▁▁▁
SOCAL 0 1 0.09 0.28 0.0 0 0 0 1 ▇▁▁▁▁
KANSAS 0 1 0.02 0.14 0.0 0 0 0 1 ▇▁▁▁▁
NEWMEXICO 0 1 0.04 0.20 0.0 0 0 0 1 ▇▁▁▁▁
CALI_NEVADA 0 1 0.04 0.19 0.0 0 0 0 1 ▇▁▁▁▁
PRAIRIE 0 1 0.03 0.17 0.0 0 0 0 1 ▇▁▁▁▁
MYTHICAL.BEVERAGE.ULTRA 0 1 0.55 0.50 0.0 0 1 1 1 ▆▁▁▁▇
SUPER.DUPER.PURE.ZERO 0 1 0.37 0.48 0.0 0 0 1 1 ▇▁▁▁▅
HILL.MOISTURE.JUMPIN.FISH 0 1 0.04 0.19 0.0 0 0 0 1 ▇▁▁▁▁
VENOMOUS.BLAST 0 1 0.03 0.17 0.0 0 0 0 1 ▇▁▁▁▁
POW.POW 0 1 0.01 0.10 0.0 0 0 0 1 ▇▁▁▁▁
MYTHICAL.BEVERAGE.REHAB 0 1 0.00 0.02 0.0 0 0 0 1 ▇▁▁▁▁
MYTHICAL.BEVERAGE.ULTRA.KIWANO.UNFLAVORED. 0 1 0.55 0.50 0.0 0 1 1 1 ▆▁▁▁▇
SUPER.DUPER.PURE.ZERO.KIWANO.KEKE. 0 1 0.30 0.46 0.0 0 0 1 1 ▇▁▁▁▃
RAINING.JUMPIN.FISH.GAME.FUEL.ZERO.CHARGED.KIWANO.SHOCK. 0 1 0.04 0.19 0.0 0 0 0 1 ▇▁▁▁▁
SUPER.DUPER.PURE.ZERO.KIWANO. 0 1 0.37 0.48 0.0 0 0 1 1 ▇▁▁▁▅
VENOMOUS.BLAST.KIWANO.DURIAN. 0 1 0.03 0.17 0.0 0 0 0 1 ▇▁▁▁▁
POW.POW.WYLDIN.KIWANO. 0 1 0.01 0.10 0.0 0 0 0 1 ▇▁▁▁▁
MYTHICAL.BEVERAGE.REHAB.KIWANO. 0 1 0.00 0.02 0.0 0 0 0 1 ▇▁▁▁▁
X16SMALL.4ONE.CUP 0 1 0.15 0.36 0.0 0 0 0 1 ▇▁▁▁▂
X16SMALL.MULTI.CUP 0 1 0.77 0.42 0.0 1 1 1 1 ▂▁▁▁▇
X16SMALL.24ONE.CUP 0 1 0.07 0.25 0.0 0 0 0 1 ▇▁▁▁▁
X16SMALL.12ONE.CUP 0 1 0.01 0.10 0.0 0 0 0 1 ▇▁▁▁▁
Week_Of_Year 0 1 25.13 15.37 1.0 12 24 39 53 ▇▇▆▆▆
Week_Since_Launch 0 1 65.58 40.11 1.0 29 66 99 152 ▇▆▇▆▂

One Hot encoded down to just over 8000 rows from sampled data and up to 33 features.

#Remove outliers in top 1% of Unit Sales. 
df1 <- df1 %>% filter(UNIT_SALES < quantile(UNIT_SALES, 0.99))
# Split the data
set.seed(123)
df_testtrn <- initial_split(df1, prop = 0.8, strata = UNIT_SALES)
Train <- training(df_testtrn)
Test <- testing(df_testtrn)
# Prepare features and labels for XGBoost
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES
# Convert data to DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Define XGBoost parameters
set.seed(123)
params <- list(
  booster = "gbtree",
  objective = "reg:squarederror",
  eval_metric = "rmse",
  eta = 0.05,
  max_depth = 4,
  min_child_weight = 3,
  subsample = 0.7,
  colsample_bytree = 0.6,
  lambda = 1,
  alpha = 1
)
# Perform cross-validation to find the optimal number of boosting rounds
cv_results <- xgb.cv(
  params = params,
  data = dtrain,  
  nfold = 5,
  nrounds = 500,  # Changed from 'num_boost_round' to 'nrounds'
  early_stopping_rounds = 10,
  metrics = "rmse",
  seed = 123
)
## [1]  train-rmse:217.705776+1.726588  test-rmse:217.646442+6.056629 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:210.405385+1.852384  test-rmse:210.425573+6.148944 
## [3]  train-rmse:203.574471+1.951320  test-rmse:203.611020+5.928709 
## [4]  train-rmse:196.898200+1.718560  test-rmse:196.906546+6.172606 
## [5]  train-rmse:190.845352+1.592508  test-rmse:190.903124+6.130116 
## [6]  train-rmse:184.874631+1.561967  test-rmse:184.974468+6.172219 
## [7]  train-rmse:179.604110+1.212913  test-rmse:179.705730+6.458494 
## [8]  train-rmse:174.356396+1.188794  test-rmse:174.468449+6.500571 
## [9]  train-rmse:169.601819+0.902011  test-rmse:169.747975+6.774740 
## [10] train-rmse:165.223534+1.017305  test-rmse:165.398048+6.702410 
## [11] train-rmse:161.041804+0.800571  test-rmse:161.187727+6.906616 
## [12] train-rmse:157.075138+0.741441  test-rmse:157.214980+6.924278 
## [13] train-rmse:153.672930+0.970396  test-rmse:153.905586+6.851054 
## [14] train-rmse:150.488336+1.236060  test-rmse:150.717887+6.443776 
## [15] train-rmse:147.429706+1.486844  test-rmse:147.678809+6.255446 
## [16] train-rmse:144.559867+1.666008  test-rmse:144.814005+6.056320 
## [17] train-rmse:141.878498+1.542848  test-rmse:142.137111+6.138557 
## [18] train-rmse:139.568967+1.705662  test-rmse:139.851989+5.967131 
## [19] train-rmse:137.192000+1.639567  test-rmse:137.577392+6.046113 
## [20] train-rmse:135.116385+1.691470  test-rmse:135.514617+5.985587 
## [21] train-rmse:133.312662+1.609342  test-rmse:133.749912+6.113063 
## [22] train-rmse:131.748143+1.331677  test-rmse:132.238788+6.382739 
## [23] train-rmse:130.185531+1.383939  test-rmse:130.667299+6.364131 
## [24] train-rmse:128.612633+1.166141  test-rmse:129.135985+6.517994 
## [25] train-rmse:126.906077+1.097951  test-rmse:127.467588+6.515756 
## [26] train-rmse:125.594452+1.040191  test-rmse:126.168142+6.481015 
## [27] train-rmse:124.185118+0.992398  test-rmse:124.745337+6.427232 
## [28] train-rmse:123.005509+1.058536  test-rmse:123.608229+6.286645 
## [29] train-rmse:122.040433+1.143680  test-rmse:122.627676+6.172173 
## [30] train-rmse:121.084636+1.184286  test-rmse:121.701602+6.109215 
## [31] train-rmse:120.010516+1.146676  test-rmse:120.644412+6.075442 
## [32] train-rmse:119.138369+1.041651  test-rmse:119.811827+6.130964 
## [33] train-rmse:118.270005+1.038676  test-rmse:118.975096+6.114711 
## [34] train-rmse:117.406429+1.042121  test-rmse:118.130063+6.062164 
## [35] train-rmse:116.671483+1.066339  test-rmse:117.424609+5.991489 
## [36] train-rmse:116.038411+1.108969  test-rmse:116.847059+5.938512 
## [37] train-rmse:115.411318+1.189561  test-rmse:116.262070+5.836692 
## [38] train-rmse:114.769714+1.182860  test-rmse:115.650483+5.787148 
## [39] train-rmse:114.233882+1.213768  test-rmse:115.144265+5.735848 
## [40] train-rmse:113.711103+1.180795  test-rmse:114.665068+5.717172 
## [41] train-rmse:113.237289+1.181022  test-rmse:114.224130+5.682975 
## [42] train-rmse:112.786174+1.164327  test-rmse:113.806499+5.647998 
## [43] train-rmse:112.358566+1.215843  test-rmse:113.381558+5.556097 
## [44] train-rmse:111.908613+1.207228  test-rmse:112.990934+5.509302 
## [45] train-rmse:111.505880+1.227312  test-rmse:112.609532+5.480405 
## [46] train-rmse:111.159853+1.227209  test-rmse:112.269001+5.449478 
## [47] train-rmse:110.837257+1.264560  test-rmse:111.953361+5.369060 
## [48] train-rmse:110.496226+1.228425  test-rmse:111.648858+5.336585 
## [49] train-rmse:110.207935+1.247835  test-rmse:111.388688+5.275763 
## [50] train-rmse:109.896859+1.230994  test-rmse:111.095440+5.250226 
## [51] train-rmse:109.630616+1.209032  test-rmse:110.873257+5.226998 
## [52] train-rmse:109.395453+1.171001  test-rmse:110.681952+5.240345 
## [53] train-rmse:109.178581+1.170910  test-rmse:110.500511+5.215448 
## [54] train-rmse:108.959216+1.204725  test-rmse:110.270525+5.124835 
## [55] train-rmse:108.736562+1.189279  test-rmse:110.089529+5.092821 
## [56] train-rmse:108.520950+1.191878  test-rmse:109.892245+5.050693 
## [57] train-rmse:108.303594+1.197308  test-rmse:109.679085+4.974027 
## [58] train-rmse:108.123674+1.204976  test-rmse:109.524332+4.918725 
## [59] train-rmse:107.949782+1.195668  test-rmse:109.372370+4.918172 
## [60] train-rmse:107.814083+1.201812  test-rmse:109.260263+4.914265 
## [61] train-rmse:107.659414+1.179929  test-rmse:109.130108+4.895703 
## [62] train-rmse:107.488704+1.185633  test-rmse:108.996983+4.862993 
## [63] train-rmse:107.354358+1.166640  test-rmse:108.886217+4.836369 
## [64] train-rmse:107.201045+1.154677  test-rmse:108.764455+4.809095 
## [65] train-rmse:107.067648+1.166265  test-rmse:108.671354+4.796509 
## [66] train-rmse:106.926545+1.171891  test-rmse:108.536175+4.749954 
## [67] train-rmse:106.833710+1.169548  test-rmse:108.454146+4.742443 
## [68] train-rmse:106.706328+1.163537  test-rmse:108.348051+4.742568 
## [69] train-rmse:106.583383+1.157084  test-rmse:108.252509+4.702418 
## [70] train-rmse:106.450339+1.157228  test-rmse:108.159475+4.713849 
## [71] train-rmse:106.315701+1.156993  test-rmse:108.063895+4.699920 
## [72] train-rmse:106.194928+1.151865  test-rmse:108.002515+4.657357 
## [73] train-rmse:106.089994+1.126558  test-rmse:107.910730+4.674931 
## [74] train-rmse:105.985273+1.117598  test-rmse:107.833325+4.671763 
## [75] train-rmse:105.872339+1.126851  test-rmse:107.722153+4.639997 
## [76] train-rmse:105.768401+1.150273  test-rmse:107.632913+4.610496 
## [77] train-rmse:105.684353+1.142778  test-rmse:107.559957+4.621758 
## [78] train-rmse:105.606365+1.132159  test-rmse:107.485752+4.635423 
## [79] train-rmse:105.528355+1.137745  test-rmse:107.431489+4.613220 
## [80] train-rmse:105.439508+1.130636  test-rmse:107.371627+4.596848 
## [81] train-rmse:105.349802+1.114822  test-rmse:107.302164+4.583688 
## [82] train-rmse:105.242874+1.126858  test-rmse:107.224359+4.578922 
## [83] train-rmse:105.144347+1.113740  test-rmse:107.145975+4.598302 
## [84] train-rmse:105.042366+1.097820  test-rmse:107.087537+4.563330 
## [85] train-rmse:104.979591+1.097749  test-rmse:107.042504+4.571220 
## [86] train-rmse:104.908312+1.104987  test-rmse:106.998774+4.542830 
## [87] train-rmse:104.835939+1.109647  test-rmse:106.959278+4.531036 
## [88] train-rmse:104.743684+1.110854  test-rmse:106.906384+4.511412 
## [89] train-rmse:104.675038+1.115236  test-rmse:106.845715+4.473053 
## [90] train-rmse:104.610710+1.118805  test-rmse:106.818077+4.468535 
## [91] train-rmse:104.541904+1.098816  test-rmse:106.785860+4.479545 
## [92] train-rmse:104.475743+1.112554  test-rmse:106.729978+4.438968 
## [93] train-rmse:104.389205+1.107802  test-rmse:106.672804+4.440723 
## [94] train-rmse:104.336561+1.106887  test-rmse:106.644459+4.434061 
## [95] train-rmse:104.280621+1.103414  test-rmse:106.589836+4.422531 
## [96] train-rmse:104.209040+1.096128  test-rmse:106.566293+4.423767 
## [97] train-rmse:104.139996+1.090528  test-rmse:106.518266+4.410642 
## [98] train-rmse:104.075481+1.072181  test-rmse:106.473619+4.408489 
## [99] train-rmse:104.032712+1.069677  test-rmse:106.446107+4.401632 
## [100]    train-rmse:103.960357+1.063107  test-rmse:106.409175+4.387759 
## [101]    train-rmse:103.903838+1.053711  test-rmse:106.377790+4.382538 
## [102]    train-rmse:103.855284+1.049574  test-rmse:106.347117+4.395553 
## [103]    train-rmse:103.785864+1.074153  test-rmse:106.293021+4.366604 
## [104]    train-rmse:103.725926+1.066042  test-rmse:106.251507+4.371974 
## [105]    train-rmse:103.671380+1.075536  test-rmse:106.222887+4.357470 
## [106]    train-rmse:103.624556+1.078243  test-rmse:106.215842+4.350971 
## [107]    train-rmse:103.563840+1.084098  test-rmse:106.192279+4.335865 
## [108]    train-rmse:103.502446+1.102615  test-rmse:106.150729+4.294037 
## [109]    train-rmse:103.458180+1.098972  test-rmse:106.117737+4.304934 
## [110]    train-rmse:103.411051+1.085172  test-rmse:106.103294+4.309691 
## [111]    train-rmse:103.370825+1.085410  test-rmse:106.096305+4.314054 
## [112]    train-rmse:103.317972+1.083958  test-rmse:106.065302+4.317350 
## [113]    train-rmse:103.262625+1.075745  test-rmse:106.029728+4.318751 
## [114]    train-rmse:103.225212+1.066823  test-rmse:106.003426+4.313490 
## [115]    train-rmse:103.164468+1.062869  test-rmse:105.981811+4.295438 
## [116]    train-rmse:103.106271+1.057822  test-rmse:105.961667+4.322044 
## [117]    train-rmse:103.054937+1.055968  test-rmse:105.940956+4.310770 
## [118]    train-rmse:103.015156+1.055820  test-rmse:105.915448+4.287661 
## [119]    train-rmse:102.955009+1.051020  test-rmse:105.901180+4.273945 
## [120]    train-rmse:102.906140+1.046282  test-rmse:105.865737+4.266017 
## [121]    train-rmse:102.868256+1.057312  test-rmse:105.841330+4.246234 
## [122]    train-rmse:102.813576+1.060385  test-rmse:105.814376+4.232707 
## [123]    train-rmse:102.757229+1.088211  test-rmse:105.791383+4.212717 
## [124]    train-rmse:102.699640+1.088199  test-rmse:105.759068+4.198639 
## [125]    train-rmse:102.662515+1.094546  test-rmse:105.758886+4.200814 
## [126]    train-rmse:102.590459+1.101506  test-rmse:105.719216+4.208274 
## [127]    train-rmse:102.537992+1.107099  test-rmse:105.704954+4.207722 
## [128]    train-rmse:102.501141+1.104981  test-rmse:105.683540+4.210462 
## [129]    train-rmse:102.454916+1.100548  test-rmse:105.649931+4.212549 
## [130]    train-rmse:102.420911+1.095504  test-rmse:105.641604+4.215102 
## [131]    train-rmse:102.382171+1.093489  test-rmse:105.632402+4.220375 
## [132]    train-rmse:102.343972+1.087291  test-rmse:105.615151+4.221160 
## [133]    train-rmse:102.301734+1.094049  test-rmse:105.605219+4.204597 
## [134]    train-rmse:102.259216+1.088933  test-rmse:105.588173+4.177427 
## [135]    train-rmse:102.197632+1.090465  test-rmse:105.553342+4.164299 
## [136]    train-rmse:102.159578+1.092124  test-rmse:105.563892+4.158929 
## [137]    train-rmse:102.117678+1.091260  test-rmse:105.535869+4.139190 
## [138]    train-rmse:102.073379+1.088426  test-rmse:105.513628+4.137416 
## [139]    train-rmse:102.044361+1.084385  test-rmse:105.491182+4.136099 
## [140]    train-rmse:101.978440+1.056975  test-rmse:105.450249+4.160764 
## [141]    train-rmse:101.949579+1.054336  test-rmse:105.447676+4.149661 
## [142]    train-rmse:101.899963+1.038619  test-rmse:105.408220+4.150925 
## [143]    train-rmse:101.867287+1.033809  test-rmse:105.382365+4.158884 
## [144]    train-rmse:101.834973+1.028955  test-rmse:105.376583+4.153997 
## [145]    train-rmse:101.800781+1.020013  test-rmse:105.355388+4.163804 
## [146]    train-rmse:101.761408+1.029571  test-rmse:105.331032+4.163345 
## [147]    train-rmse:101.732601+1.028802  test-rmse:105.324131+4.156258 
## [148]    train-rmse:101.695686+1.030180  test-rmse:105.332667+4.146834 
## [149]    train-rmse:101.663309+1.025870  test-rmse:105.310385+4.144242 
## [150]    train-rmse:101.631562+1.024428  test-rmse:105.309080+4.151472 
## [151]    train-rmse:101.614917+1.024407  test-rmse:105.311228+4.140177 
## [152]    train-rmse:101.580140+1.022296  test-rmse:105.307063+4.137872 
## [153]    train-rmse:101.551503+1.012723  test-rmse:105.303371+4.157111 
## [154]    train-rmse:101.491454+1.019352  test-rmse:105.287224+4.130295 
## [155]    train-rmse:101.454370+1.010309  test-rmse:105.266561+4.128269 
## [156]    train-rmse:101.426843+1.005533  test-rmse:105.275174+4.119941 
## [157]    train-rmse:101.402391+1.005779  test-rmse:105.268038+4.112996 
## [158]    train-rmse:101.364443+1.000622  test-rmse:105.244424+4.105382 
## [159]    train-rmse:101.325293+1.010992  test-rmse:105.235812+4.113047 
## [160]    train-rmse:101.297977+1.012798  test-rmse:105.234310+4.098510 
## [161]    train-rmse:101.269076+1.007082  test-rmse:105.232939+4.097727 
## [162]    train-rmse:101.239589+1.000364  test-rmse:105.231916+4.094354 
## [163]    train-rmse:101.206147+1.005057  test-rmse:105.220184+4.082585 
## [164]    train-rmse:101.178684+1.003359  test-rmse:105.204772+4.085032 
## [165]    train-rmse:101.155662+0.993086  test-rmse:105.182076+4.090590 
## [166]    train-rmse:101.117763+0.999881  test-rmse:105.164226+4.066201 
## [167]    train-rmse:101.092743+0.988887  test-rmse:105.155499+4.066326 
## [168]    train-rmse:101.065961+0.977494  test-rmse:105.162879+4.054113 
## [169]    train-rmse:101.035898+0.983972  test-rmse:105.166591+4.047285 
## [170]    train-rmse:101.008053+0.985241  test-rmse:105.148481+4.033395 
## [171]    train-rmse:100.962330+0.986424  test-rmse:105.147480+4.043883 
## [172]    train-rmse:100.933215+0.981249  test-rmse:105.140726+4.031063 
## [173]    train-rmse:100.901074+0.988444  test-rmse:105.140361+4.006925 
## [174]    train-rmse:100.875753+0.986333  test-rmse:105.130550+3.999878 
## [175]    train-rmse:100.849817+0.983349  test-rmse:105.116837+3.986140 
## [176]    train-rmse:100.817544+0.983382  test-rmse:105.112570+3.985896 
## [177]    train-rmse:100.795199+0.981469  test-rmse:105.098115+3.986405 
## [178]    train-rmse:100.766856+0.982664  test-rmse:105.087798+4.001022 
## [179]    train-rmse:100.750584+0.985351  test-rmse:105.083388+3.996594 
## [180]    train-rmse:100.722144+0.989597  test-rmse:105.091095+3.996774 
## [181]    train-rmse:100.687992+0.990381  test-rmse:105.094837+3.994948 
## [182]    train-rmse:100.656229+0.994809  test-rmse:105.108010+3.994613 
## [183]    train-rmse:100.628636+0.994684  test-rmse:105.100448+4.003163 
## [184]    train-rmse:100.603603+1.004044  test-rmse:105.089808+3.997248 
## [185]    train-rmse:100.571167+1.017227  test-rmse:105.082038+4.000785 
## [186]    train-rmse:100.533987+1.019749  test-rmse:105.083299+3.990631 
## [187]    train-rmse:100.513657+1.015290  test-rmse:105.085105+3.988404 
## [188]    train-rmse:100.486879+1.024148  test-rmse:105.067510+3.984495 
## [189]    train-rmse:100.464710+1.019841  test-rmse:105.052180+3.985632 
## [190]    train-rmse:100.445860+1.017233  test-rmse:105.050569+3.978194 
## [191]    train-rmse:100.412713+1.023659  test-rmse:105.026976+3.960303 
## [192]    train-rmse:100.378254+1.017109  test-rmse:105.025809+3.964908 
## [193]    train-rmse:100.348592+1.009292  test-rmse:105.022782+3.966474 
## [194]    train-rmse:100.320556+1.004093  test-rmse:105.022983+3.963966 
## [195]    train-rmse:100.300810+0.999951  test-rmse:105.019926+3.964219 
## [196]    train-rmse:100.276388+0.995213  test-rmse:105.019174+3.958770 
## [197]    train-rmse:100.244758+0.991710  test-rmse:105.013163+3.953816 
## [198]    train-rmse:100.218024+0.979115  test-rmse:105.017656+3.940724 
## [199]    train-rmse:100.192420+0.978570  test-rmse:105.022365+3.924850 
## [200]    train-rmse:100.175061+0.979360  test-rmse:105.016225+3.905252 
## [201]    train-rmse:100.141860+0.979913  test-rmse:105.011539+3.893110 
## [202]    train-rmse:100.119294+0.980655  test-rmse:105.021615+3.889566 
## [203]    train-rmse:100.087983+0.980008  test-rmse:105.013679+3.879009 
## [204]    train-rmse:100.067846+0.979039  test-rmse:105.002239+3.896566 
## [205]    train-rmse:100.046014+0.983494  test-rmse:105.009523+3.894700 
## [206]    train-rmse:100.010933+0.982794  test-rmse:104.999094+3.893370 
## [207]    train-rmse:99.985484+0.971739   test-rmse:105.006011+3.900278 
## [208]    train-rmse:99.968869+0.969234   test-rmse:104.990949+3.906836 
## [209]    train-rmse:99.938276+0.978956   test-rmse:104.979354+3.891847 
## [210]    train-rmse:99.916563+0.985604   test-rmse:104.981912+3.879570 
## [211]    train-rmse:99.900399+0.984325   test-rmse:104.977073+3.881675 
## [212]    train-rmse:99.880024+0.988301   test-rmse:104.962749+3.863712 
## [213]    train-rmse:99.856189+0.995239   test-rmse:104.972623+3.863823 
## [214]    train-rmse:99.831703+0.999802   test-rmse:104.974014+3.861844 
## [215]    train-rmse:99.808883+0.999512   test-rmse:104.975537+3.858152 
## [216]    train-rmse:99.790881+1.004848   test-rmse:104.962918+3.846758 
## [217]    train-rmse:99.772378+1.007723   test-rmse:104.946749+3.836517 
## [218]    train-rmse:99.755767+1.005294   test-rmse:104.944672+3.835519 
## [219]    train-rmse:99.734916+1.006098   test-rmse:104.948898+3.829234 
## [220]    train-rmse:99.716835+1.003819   test-rmse:104.953244+3.831615 
## [221]    train-rmse:99.681587+0.986968   test-rmse:104.952624+3.828575 
## [222]    train-rmse:99.664689+0.987398   test-rmse:104.936489+3.825513 
## [223]    train-rmse:99.641951+0.984801   test-rmse:104.946943+3.812131 
## [224]    train-rmse:99.620762+0.989544   test-rmse:104.945522+3.809201 
## [225]    train-rmse:99.608931+0.987994   test-rmse:104.947984+3.809749 
## [226]    train-rmse:99.594940+0.993725   test-rmse:104.932484+3.812721 
## [227]    train-rmse:99.572979+0.998810   test-rmse:104.940761+3.811802 
## [228]    train-rmse:99.551276+0.996702   test-rmse:104.945003+3.814768 
## [229]    train-rmse:99.534661+0.993734   test-rmse:104.947652+3.809602 
## [230]    train-rmse:99.518743+0.986978   test-rmse:104.945441+3.815927 
## [231]    train-rmse:99.506661+0.987994   test-rmse:104.949409+3.808613 
## [232]    train-rmse:99.483904+0.979884   test-rmse:104.939976+3.796890 
## [233]    train-rmse:99.458698+0.970649   test-rmse:104.941916+3.793822 
## [234]    train-rmse:99.441665+0.971398   test-rmse:104.955933+3.800672 
## [235]    train-rmse:99.429076+0.968957   test-rmse:104.952240+3.814654 
## [236]    train-rmse:99.398246+0.967131   test-rmse:104.948200+3.811817 
## Stopping. Best iteration:
## [226]    train-rmse:99.594940+0.993725   test-rmse:104.932484+3.812721
best_nrounds <- cv_results$best_iteration
# Train the final model using the best number of rounds found
model_xgb <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb, dtrain)
test_pred <- predict(model_xgb, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)
# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)
train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
    "--------------------------\n",
    "Training RMSE: ", train_rmse, "\n",
    "Test RMSE: ", test_rmse, "\n",
    "Training R-squared: ", r_squared_train, "\n",
    "Test R-squared: ", r_squared_test, "\n",
    "Training MAE: ", train_mae, "\n",
    "Test MAE: ", test_mae, "\n",
    "Training MAPE: ", train_mape, "%\n",
    "Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 100.2079
## Test RMSE: 109.262
## Training R-squared: 0.6840278
## Test R-squared: 0.6283978
## Training MAE: 58.94409
## Test MAE: 62.40066
## Training MAPE: 231.2208%
## Test MAPE: 223.7863%

For the Kiwano Energy model, Our train RMSE is 100.29 and test 109.32. We expect to see the drop from train to test. With the difference we may need to check if there is slight overfitting. With the R2 for test and train are both moderate at .68 training .67 testing, this indicates there is some but not all variance eplained by our model. Our MAE also is low and does not contain a significant difference between training and test. The last metric, MAPE, both values are at 232% meaning that we are with about 224% of the actual values. Overall this model does show some predictive power but with more features we maybe able to get stronger predictive power.

# Calculate feature importance
importance_matrix2 <- xgb.importance(feature_names = colnames(train_features), model = model_xgb)
# View the feature importance scores
print(importance_matrix2)
##                                                      Feature         Gain
##  1:                                  MYTHICAL.BEVERAGE.ULTRA 1.976594e-01
##  2:                                       X16SMALL.MULTI.CUP 1.613989e-01
##  3:               MYTHICAL.BEVERAGE.ULTRA.KIWANO.UNFLAVORED. 1.477780e-01
##  4:                                        X16SMALL.4ONE.CUP 8.659552e-02
##  5:                                        Week_Since_Launch 8.278842e-02
##  6:                                       X16SMALL.24ONE.CUP 8.058614e-02
##  7:                                    SUPER.DUPER.PURE.ZERO 4.150738e-02
##  8:                                                  PRAIRIE 3.174524e-02
##  9:                                                 COLORADO 2.806605e-02
## 10:                       SUPER.DUPER.PURE.ZERO.KIWANO.KEKE. 2.458380e-02
## 11:                                                   KANSAS 1.890134e-02
## 12:                                                    NOCAL 1.401292e-02
## 13:                                HILL.MOISTURE.JUMPIN.FISH 1.366497e-02
## 14:                                             Week_Of_Year 1.297703e-02
## 15:                                                 NORTHERN 8.778455e-03
## 16:                                           VENOMOUS.BLAST 8.147177e-03
## 17:                            SUPER.DUPER.PURE.ZERO.KIWANO. 7.398521e-03
## 18:                                                  ARIZONA 6.004959e-03
## 19:                                              CALI_NEVADA 5.905601e-03
## 20:                                                    SOCAL 5.748893e-03
## 21:                                                DESERT_SW 3.905222e-03
## 22: RAINING.JUMPIN.FISH.GAME.FUEL.ZERO.CHARGED.KIWANO.SHOCK. 3.858922e-03
## 23:                                                 MOUNTAIN 3.597554e-03
## 24:                                                NEWMEXICO 2.997507e-03
## 25:                            VENOMOUS.BLAST.KIWANO.DURIAN. 1.372027e-03
## 26:                                                  POW.POW 1.096999e-05
## 27:                                   POW.POW.WYLDIN.KIWANO. 7.459850e-06
## 28:                                       X16SMALL.12ONE.CUP 1.628477e-06
##                                                      Feature         Gain
##            Cover    Frequency
##  1: 5.625228e-02 0.0738948048
##  2: 7.601545e-02 0.0580832527
##  3: 3.900313e-02 0.0319457890
##  4: 1.227161e-02 0.0293643111
##  5: 2.398665e-01 0.2362052275
##  6: 1.117101e-02 0.0183930300
##  7: 1.440127e-02 0.0225879316
##  8: 4.779081e-02 0.0319457890
##  9: 3.974239e-02 0.0325911584
## 10: 3.073693e-02 0.0393675379
## 11: 5.117822e-02 0.0454985479
## 12: 3.887320e-02 0.0238786705
## 13: 2.396210e-03 0.0054856405
## 14: 1.123534e-01 0.1613423685
## 15: 2.704485e-02 0.0325911584
## 16: 9.647209e-03 0.0093578574
## 17: 3.921251e-03 0.0054856405
## 18: 2.087564e-02 0.0242013553
## 19: 4.056876e-02 0.0190383995
## 20: 3.233374e-02 0.0209745079
## 21: 3.179025e-02 0.0190383995
## 22: 1.894793e-03 0.0032268474
## 23: 3.500368e-02 0.0245240400
## 24: 2.208191e-02 0.0200064537
## 25: 2.086846e-03 0.0048402711
## 26: 3.756917e-04 0.0038722168
## 27: 3.128289e-04 0.0019361084
## 28: 1.014714e-05 0.0003226847
##            Cover    Frequency
xgb.plot.importance(importance_matrix = importance_matrix2)

From this Importance matrix we see that brand and size seem to be the two biggest contributors to our model. We also see that the created featrure Week_Since_Launche is playing a large part in the creation of predictions.

NEXT STEPS & FINAL THOUGHTS

Our next series of processes, for presentation to the client, include revisiting a number of critical variables and insights we found along the way through this model submission. This includes deliberate forecasting and time series modeling on real datasets that represent products that any new innovation would hope to pursue in the market. We will also focus on demand in prescriptive analytics. There are infinite tasks we could employ.

TEAM PARTICIPATION

All team members participated equally to the creation of this notebook. Each member took a lengthy look at each of the questions posed by the client, Swire, and we argued over those that presented an evident way forward. Our team is aware that this project is barely half over and the submission of this notebook constitutes the basis on which we will substantiate claims made to the client.